crealytics / spark-excel

A Spark plugin for reading and writing Excel files
Apache License 2.0
463 stars 147 forks source link

java.lang.ClassNotFoundException: shadeio.poi.ss.usermodel.WorkbookFactory error #464

Open grajee-everest opened 2 years ago

grajee-everest commented 2 years ago

All,

I'm getting the error when I run the code in notebook based off of SQLServer 2019 Big Data Cluster from Microsoft. I'm not sure if I have all the jars added or the right versions of them. I'm unable to find any relevant documentation

Does anyone know what I'm doing wrong?

Please see the image . The jar files are in red rectangle.

image

%%configure -f {"conf": {"spark.jars": "hdfs:///modules/jar/spark-excel/commons-collections4-4.4.jar,hdfs:///modules/jar/spark-excel/poi-ooxml-schemas-4.1.2.jar,hdfs:///modules/jar/spark-excel/spark-excel_2.12-3.1.2_0.15.0.jar,hdfs:///modules/jar/spark-excel/xmlbeans-3.1.0.jar"} }

from pyspark.sql.functions import input_file_name

df = spark.read.format("excel") \ .option("header", True) \ .option("inferSchema", True) \ .load(f"hdfs:///test/users.xls") \ .withColumn("file_name", input_file_name())

Expected Behavior

See the records from the Excel file

Current Behavior

An error was encountered: An error occurred while calling o79.load. : java.lang.NoClassDefFoundError: shadeio/poi/ss/usermodel/WorkbookFactory at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:107) at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:122) at com.crealytics.spark.v2.excel.ExcelTable.infer(ExcelTable.scala:69) at com.crealytics.spark.v2.excel.ExcelTable.inferSchema(ExcelTable.scala:42) at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69) at scala.Option.orElse(Option.scala:447) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80) at com.crealytics.spark.v2.excel.ExcelDataSource.inferSchema(ExcelDataSource.scala:85) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81) at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:295) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:265) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: shadeio.poi.ss.usermodel.WorkbookFactory at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ... 27 more

Traceback (most recent call last): File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 204, in load return self._df(self._jreader.load(path)) File "/opt/spark/python/lib/py4j-src.zip/py4j/java_gateway.py", line 1304, in call return_value = get_return_value( File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw) File "/opt/spark/python/lib/py4j-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o79.load. : java.lang.NoClassDefFoundError: shadeio/poi/ss/usermodel/WorkbookFactory at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:107) at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:122) at com.crealytics.spark.v2.excel.ExcelTable.infer(ExcelTable.scala:69) at com.crealytics.spark.v2.excel.ExcelTable.inferSchema(ExcelTable.scala:42) at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69) at scala.Option.orElse(Option.scala:447) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80) at com.crealytics.spark.v2.excel.ExcelDataSource.inferSchema(ExcelDataSource.scala:85) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81) at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:295) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:265) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: shadeio.poi.ss.usermodel.WorkbookFactory at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ... 27 more

Possible Solution

Looks like I might not have all the jar files or the correct version of them

Steps to Reproduce (for bugs)

%%configure -f {"conf": {"spark.jars": "hdfs:///modules/jar/spark-excel/commons-collections4-4.4.jar,hdfs:///modules/jar/spark-excel/poi-ooxml-schemas-4.1.2.jar,hdfs:///modules/jar/spark-excel/spark-excel_2.12-3.1.2_0.15.0.jar,hdfs:///modules/jar/spark-excel/xmlbeans-3.1.0.jar"} }

from pyspark.sql.functions import input_file_name

df = spark.read.format("excel") \ .option("header", True) \ .option("inferSchema", True) \ .load(f"hdfs:///test/users.xls") \ .withColumn("file_name", input_file_name())

Context

Not able to read from Excel files

Your Environment

Include as many relevant details about the environment you experienced the bug in https://docs.microsoft.com/en-us/sql/big-data-cluster/release-notes-cumulative-update-13?view=sql-server-ver15

Operating System: Ubuntu 20.04.3 LTS Microsoft Spark Runtime 2021.1 Spark: 3.1.2 Delta Lake: 1.0.0 Java: Azul Zulu JRE 1.8.0_275 Scala: 2.12 Python: 3.8 (miniforge 4.9) R: Microsoft R 3.5.2 Spark SQL Connector: 1.2.0

quanghgx commented 2 years ago

Hi @grajee-everest Thank you so much for sharing detail about the issue. Let me try it out this weekend and get back to you.

Adding @nightscape, as this issue might related to POI upgrade in 0.15.0.

nightscape commented 2 years ago

Might be related to https://github.com/crealytics/spark-excel/pull/465 Can you try 0.15.1?

grajee-everest commented 2 years ago

I copied the jar file "spark-excel_2.12-3.1.2_0.15.1.jar" from https://search.maven.org/artifact/com.crealytics/spark-excel_2.12/3.1.2_0.15.1/jar but it did not work as you can see in the error message. Are there other jar files that need to be replaced?

image

I merely went by what is in the link and copied the jar files when I originally tried deploying spark-excel library.

An error was encountered: An error occurred while calling o79.load. : java.lang.NoClassDefFoundError: org/apache/commons/io/output/UnsynchronizedByteArrayOutputStream at shadeio.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:206) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:172) at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:55) at scala.Option.fold(Option.scala:251) at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:55) at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:16) at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:15) at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:50) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104) at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103) at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:172) at scala.Option.getOrElse(Option.scala:189) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:171) at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ... 37 more

Traceback (most recent call last): File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 204, in load return self._df(self._jreader.load(path)) File "/opt/spark/python/lib/py4j-src.zip/py4j/java_gateway.py", line 1304, in call return_value = get_return_value( File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw) File "/opt/spark/python/lib/py4j-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o79.load. : java.lang.NoClassDefFoundError: org/apache/commons/io/output/UnsynchronizedByteArrayOutputStream at shadeio.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:206) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:172) at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:55) at scala.Option.fold(Option.scala:251) at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:55) at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:16) at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:15) at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:50) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104) at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103) at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:172) at scala.Option.getOrElse(Option.scala:189) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:171) at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ... 37 more

image

grajee-everest commented 2 years ago

I google for more info on the error and realized that it needs "commons-io-2.11.0.jar". I copied it to the hdfs drive and I ran the notebook but this time I'm getting another error as in the image -- "Caused by: java.lang.ClassNotFoundException: org.apache.logging.log4j.LogManager"

Based on the error, it seems I might have a whole set of jar files that I might need to upload. How do I find out what all jar files the spark-excel needs?

image

An error was encountered: An error occurred while calling o79.load. : java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager at shadeio.poi.util.IOUtils.(IOUtils.java:43) at shadeio.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:206) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:172) at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:55) at scala.Option.fold(Option.scala:251) at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:55) at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:16) at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:15) at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:50) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104) at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103) at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:172) at scala.Option.getOrElse(Option.scala:189) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:171) at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: org.apache.logging.log4j.LogManager at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ... 38 more

Traceback (most recent call last): File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 204, in load return self._df(self._jreader.load(path)) File "/opt/spark/python/lib/py4j-src.zip/py4j/java_gateway.py", line 1304, in call return_value = get_return_value( File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw) File "/opt/spark/python/lib/py4j-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o79.load. : java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager at shadeio.poi.util.IOUtils.(IOUtils.java:43) at shadeio.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:206) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:172) at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:55) at scala.Option.fold(Option.scala:251) at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:55) at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:16) at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:15) at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:50) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104) at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103) at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:172) at scala.Option.getOrElse(Option.scala:189) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:171) at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: org.apache.logging.log4j.LogManager at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ... 38 more

nightscape commented 2 years ago

Why are you trying to add the jars manually? Usually you would specify spark-excel by adding it as a package, then it would pull all dependencies by itself. You have to find out yourself how to do that in your environment though.

prayagkr commented 2 years ago

Spark version: 3.1.2 and 3.2.0 with java 11. Spark-Excel version: 3.1.2_0.15.0, 3.1.2_0.15.2, 3.1.2_0.15.2 and 3.1.2_0.16.0. Tried all the versions mentioned above but got an Exception.

Version: 0.14.0 is working. com.crealytics:spark-excel_2.12:0.14.0

bmdoss commented 2 years ago

Hi, I tried with above @prayagkr approach it works fine with Java. but same it not working with scala even I have tried with different version of scala, spark with spark-excel SAME ERROR please refer below. Also attached sample screen shot(one of try).

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: excel. Please find packages at http://spark.apache.org/third-party-projects.html

at org.apache.spark.sql.errors.QueryExecutionErrors$.failedToFindDataSourceError(QueryExecutionErrors.scala:443)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:670)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
at org.learnSpark.application.ReadTSV$.delayedEndpoint$org$learnSpark$application$ReadTSV$1(ReadTSV.scala:16)
at org.learnSpark.application.ReadTSV$delayedInit$body.apply(ReadTSV.scala:5)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1(App.scala:76)
at scala.App.$anonfun$main$1$adapted(App.scala:76)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561)
at scala.collection.AbstractIterable.foreach(Iterable.scala:926)
at scala.App.main(App.scala:76)
at scala.App.main$(App.scala:74)
at org.learnSpark.application.ReadTSV$.main(ReadTSV.scala:5)
at org.learnSpark.application.ReadTSV.main(ReadTSV.scala)

Caused by: java.lang.ClassNotFoundException: excel.DefaultSource at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:656) at scala.util.Try$.apply(Try.scala:210) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:656) at scala.util.Failure.orElse(Try.scala:221) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:656) ... 17 more 22/02/06 21:12:40 INFO SparkContext: Invoking stop() from shutdown hook

Capture

santhoshdesikachari commented 2 years ago

nightscape - I am frustrated having to get this working for the past 2 days. As prayagkr said, I tried all possible things and nothing works.

Firstly, getting the spark-shell working with the right maven coordinates itself is difficult, some or the other library dependency is missing (on log4j and io.common)

Next, as I keep going to older versions 0.16.4, I get this error:

java.lang.NoClassDefFoundError: org/apache/commons/io/output/UnsynchronizedByteArrayOutputStream
  at shadeio.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209)
  at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:222)
  at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
  at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:55)
  at scala.Option.fold(Option.scala:251)
  at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:55)
  at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:16)
  at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:15)
  at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:50)
  at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:32)
  at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:32)
  at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104)
  at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103)
  at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:172)
  at scala.Option.getOrElse(Option.scala:189)
  at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:171)
  at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:36)
  at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:36)
  at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13)
  at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
  at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
  ... 49 elided
Caused by: java.lang.ClassNotFoundException: org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream
  at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

I am running out of ideas to get this working.

My environment: Mac OS Montery 12.2.1 Java 1.8.0_322 spark-shell version 3.1.2 scala version 2.12

Any help would be appreciated. I have a deadline to complete some data movement from excel and I am badly in need of this. Thanks in advance.

tirancm commented 2 years ago

for java 11

org.apache.spark spark-core_2.12 3.0.1 org.apache.spark spark-sql_2.12 3.0.1 provided com.crealytics spark-excel_2.12 0.14.0 org.apache.hadoop hadoop-common 3.3.3
nightscape commented 2 years ago

Can you try 0.17.2?