crealytics / spark-excel

A Spark plugin for reading and writing Excel files
Apache License 2.0
451 stars 145 forks source link

Reading excel file in Azure Databricks #467

Open grajee-everest opened 2 years ago

grajee-everest commented 2 years ago

I'm tried to use spark-excel in Azure Databricks but I seem to be be running into an error. I earlier tried the same using SQLServer Big Data Cluster but I was unable to.

Current Behavior I'm getting an error java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B

image

I loaded first the Maven Coordinates and got the error. I later followed the link and loaded the jar files and yet got the same error as shown in the screenshot.

image

Steps to Reproduce (for bugs)

df = spark.read.format("excel") \
   .option("header", True) \
   .option("inferSchema", True) \
   .load(f"dbfs:/FileStore/tables/users.xls") \
   .withColumn("file_name", input_file_name())

Your Environment

Azure Databricks image

nightscape commented 2 years ago

I would really try to not download and add the JARs manually, but use Maven's package resolution. What was the error you got there?

grajee-everest commented 2 years ago

I got the same error when I first tried it with Maven Coordinates as in the screenshot below. Seeing this error, I went through the dependencies at the link and manually loaded the jar files hoping that it would help. But it did not.

image

image

Here is the error that got generated:


Py4JJavaError Traceback (most recent call last)

in 1 #sampleDataFilePath = "dbfs:/FileStore/tables/users.xls" 2 ----> 3 df = spark.read.format("excel") \ 4 .option("header", True) \ 5 .option("inferSchema", True) \ /databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options) 202 self.options(**options) 203 if isinstance(path, str): --> 204 return self._df(self._jreader.load(path)) 205 elif path is not None: 206 if type(path) != list: /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args) 1302 1303 answer = self.gateway_client.send_command(command) -> 1304 return_value = get_return_value( 1305 answer, self.gateway_client, self.target_id, self.name) 1306 /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 115 def deco(*a, **kw): 116 try: --> 117 return f(*a, **kw) 118 except py4j.protocol.Py4JJavaError as e: 119 converted = convert_exception(e.java_exception) /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o307.load. **: java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B** at org.apache.commons.io.output.AbstractByteArrayOutputStream.needNewBuffer(AbstractByteArrayOutputStream.java:104) at org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream.(UnsynchronizedByteArrayOutputStream.java:51) at shadeio.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:110) at shadeio.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:206) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:172) at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:107) at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:122) at com.crealytics.spark.v2.excel.ExcelTable.infer(ExcelTable.scala:69) at com.crealytics.spark.v2.excel.ExcelTable.inferSchema(ExcelTable.scala:42) at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69) at scala.Option.orElse(Option.scala:447) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80) at com.crealytics.spark.v2.excel.ExcelDataSource.inferSchema(ExcelDataSource.scala:85) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81) at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:388) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:367) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:287) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748)
grajee-everest commented 2 years ago

If it helps these are the jar files that I see running for the session:

%scala
spark.sparkContext.listJars.foreach(println)

spark://xx.xxx.xx.xx:40525/jars/addedFile2184961893124998763poi_shared_strings_2_2_3-55036.jar spark://xx.xxx.xx.xx:40525/jars/addedFile8325949175049880530poi_5_1_0-6eaa4.jar spark://xx.xxx.xx.xx:40525/jars/addedFile3805277380370442712spoiwo_2_12_2_0_0-5d426.jar spark://xx.xxx.xx.xx:40525/jars/addedFile4821020784640732815commons_text_1_9-9ec33.jar spark://xx.xxx.xx.xx:40525/jars/addedFile6096385456097086834commons_collections4_4_4-86bd5.jar spark://xx.xxx.xx.xx:40525/jars/addedFile5503460718089690954poi_ooxml_5_1_0-dcd47.jar spark://xx.xxx.xx.xx:40525/jars/addedFile1801717094295843813commons_compress_1_21-ae1b7.jar spark://xx.xxx.xx.xx:40525/jars/addedFile3469926387869248457h2_1_4_200-17cf6.jar spark://xx.xxx.xx.xx:40525/jars/addedFile7124418099051517404curvesapi_1_6-ef037.jar spark://xx.xxx.xx.xx:40525/jars/addedFile3524630059114379065slf4j_api_1_7_32-db310.jar spark://xx.xxx.xx.xx:40525/jars/addedFile621063403924903495SparseBitSet_1_2-c8237.jar spark://xx.xxx.xx.xx:40525/jars/addedFile5513775878198382075commons_io_2_11_0-998c5.jar spark://xx.xxx.xx.xx:40525/jars/addedFile6021795642522535665spark_excel_2_12_3_1_2_0_15_1-54852.jar spark://xx.xxx.xx.xx:40525/jars/addedFile2561448775843921624poi_ooxml_lite_5_1_0-a9fef.jar spark://xx.xxx.xx.xx:40525/jars/addedFile1605810761903966851commons_lang3_3_11-82b59.jar spark://xx.xxx.xx.xx:40525/jars/addedFile2616706435049414994commons_codec_1_15-3e3d3.jar spark://xx.xxx.xx.xx:40525/jars/addedFile3670030969644712160log4j_api_2_14_1-5a13d.jar spark://xx.xxx.xx.xx:40525/jars/addedFile6859359805595503404xmlbeans_5_0_2-db545.jar spark://xx.xxx.xx.xx:40525/jars/addedFile5420236778608197626scala_xml_2_12_2_0_0-e8c94.jar spark://xx.xxx.xx.xx:40525/jars/addedFile2437294818883127996commons_math3_3_6_1-876c0.jar spark://xx.xxx.xx.xx:40525/jars/addedFile3167668463888463121excel_streaming_reader_3_2_3-b4c68.jar

grajee-everest commented 2 years ago

FYI - I was able to get the elastacloud module working

nightscape commented 2 years ago

Wow, I didn't even know that project. Thanks for pointing it out!

grajee-everest commented 2 years ago

I would like to still get this working on Databricks and SQLServer BDC. Note that it is not working in Databricks for few others as well.

KanakVantaku commented 2 years ago

I am also facing similar issue. Is it resolved? java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B

soleyjh commented 2 years ago

FWIW I'm getting the exact same error: java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B

Azure Databricks: 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12). Maven Install from Databricks: com.crealytics:spark-excel_2.12:3.1.2_0.15.1

Any help would be greatly appreciated @nightscape

MounikaKolisetty commented 2 years ago

Hello Guys, I am also facing the same issue. Is there any solution or did someone found any alternative method?

nightscape commented 2 years ago

@grajee-everest @KanakVantaku @soleyjh @MounikaKolisetty could one of you try if changing https://github.com/crealytics/spark-excel/blob/main/build.sbt#L32-L37 to

shadeRenames ++= Seq(
  "org.apache.poi.**" -> "shadeio.poi.@1",
  "spoiwo.**" -> "shadeio.spoiwo.@1",
  "com.github.pjfanning.**" -> "shadeio.pjfanning.@1",
  "org.apache.commons.io.**" -> "shadeio.commons.io.@1",
  "org.apache.commons.compress.**" -> "shadeio.commons.compress.@1"
)

resolves this problem?

You would need to sbt publishLocal a version of the JAR that you upload to Databricks then.

MounikaKolisetty commented 2 years ago

Hello @nightscape , What does sbt publishLocal means? Can you please explain me in detail?

nightscape commented 2 years ago

Hi @MounikaKolisetty,

SBT is the Scala (or Simple) Build Tool. You can get instructions on how to install it here: https://www.scala-sbt.org/ Once you have it installed, you should be able to run

cd /path/to/spark-excel
# Make the changes from above
sbt publishLocal

This should start building the project and copy the generated JAR files to a path like ~/.iyv2/.../spark-excel...jar. You can take the JAR from there and try to upload and use it in Databricks.

ciaeric commented 2 years ago

same error here : java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B

Azure Databricks: 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12). Maven Install from Databricks: com.crealytics:spark-excel_2.12:3.1.2_0.15.1

Any solution yet?

Dunehub commented 2 years ago

Same error.

Azure Databricks: 9.0 (includes Apache Spark 3.1.2, Scala 2.12). Maven Install from Databricks: com.crealytics:spark-excel_2.12:3.1.2_0.16.0

nightscape commented 2 years ago

@ciaeric @Dunehub if possible, please try my proposal in https://github.com/crealytics/spark-excel/issues/467#issuecomment-984506825

spaw6065 commented 2 years ago

I am also facing same issue on Azure Databricks and looking for possible solutions. Adding dependencies as per attachment but not working

image

nightscape commented 2 years ago

Once the build here finishes successfully, you can try version 0.16.1-pre1: https://github.com/crealytics/spark-excel/actions/runs/1607777770

nightscape commented 2 years ago

@ciaeric @Dunehub @spaw6065 @MounikaKolisetty @grajee-everest Please provide feedback here if the shading worked. The change is still on a branch, so if you don't provide feedback, it won't get merged and won't be part of the next release.

spaw6065 commented 2 years ago

To me it still didn't worked

Steps followed :-

  1. Added jar from dbfs image

  2. executed sample code image

spaw6065 commented 2 years ago

Error trace NoClassDefFoundError: shadeio/commons/io/output/UnsynchronizedByteArrayOutputStream Caused by: ClassNotFoundException: shadeio.commons.io.output.UnsynchronizedByteArrayOutputStream at shadeio.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:206) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:172) at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:55) at scala.Option.fold(Option.scala:251) at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:55) at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:16) at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:15) at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:50) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104) at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103) at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:172) at scala.Option.getOrElse(Option.scala:189) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:171) at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:390) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:346) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:346) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:237) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-2541019815824441:3) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw$$iw$$iw$$iw.<init>(command-2541019815824441:47) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw$$iw$$iw.<init>(command-2541019815824441:49) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw$$iw.<init>(command-2541019815824441:51) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw.<init>(command-2541019815824441:53) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw.<init>(command-2541019815824441:55) at $line03e3e0503061413eab90de3bf6be643427.$read.<init>(command-2541019815824441:57) at $line03e3e0503061413eab90de3bf6be643427.$read$.<init>(command-2541019815824441:61) at $line03e3e0503061413eab90de3bf6be643427.$read$.<clinit>(command-2541019815824441) at $line03e3e0503061413eab90de3bf6be643427.$eval$.$print$lzycompute(<notebook>:7) at $line03e3e0503061413eab90de3bf6be643427.$eval$.$print(<notebook>:6) at $line03e3e0503061413eab90de3bf6be643427.$eval.$print(<notebook>) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:747) at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1020) at scala.tools.nsc.interpreter.IMain.$anonfun$interpret$1(IMain.scala:568) at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36) at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116) at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:41) at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:567) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:594) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:564) at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:219) at com.databricks.backend.daemon.driver.ScalaDriverLocal.$anonfun$repl$1(ScalaDriverLocal.scala:235) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:902) at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:855) at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:235) at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$13(DriverLocal.scala:541) at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:266) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:261) at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:258) at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:50) at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:305) at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:297) at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:50) at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:518) at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:689) at scala.util.Try$.apply(Try.scala:213) at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:681) at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:522) at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:634) at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:427) at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:370) at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:221) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: shadeio.commons.io.output.UnsynchronizedByteArrayOutputStream at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader.loadClass(ClassLoaders.scala:151) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at shadeio.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:206) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:172) at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:55) at scala.Option.fold(Option.scala:251) at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:55) at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:16) at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:15) at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:50) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104) at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103) at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:172) at scala.Option.getOrElse(Option.scala:189) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:171) at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:390) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:346) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:346) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:237) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-2541019815824441:3) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw$$iw$$iw$$iw.<init>(command-2541019815824441:47) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw$$iw$$iw.<init>(command-2541019815824441:49) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw$$iw.<init>(command-2541019815824441:51) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw.<init>(command-2541019815824441:53) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw.<init>(command-2541019815824441:55) at $line03e3e0503061413eab90de3bf6be643427.$read.<init>(command-2541019815824441:57) at $line03e3e0503061413eab90de3bf6be643427.$read$.<init>(command-2541019815824441:61) at $line03e3e0503061413eab90de3bf6be643427.$read$.<clinit>(command-2541019815824441) at $line03e3e0503061413eab90de3bf6be643427.$eval$.$print$lzycompute(<notebook>:7) at $line03e3e0503061413eab90de3bf6be643427.$eval$.$print(<notebook>:6) at $line03e3e0503061413eab90de3bf6be643427.$eval.$print(<notebook>) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:747) at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1020) at scala.tools.nsc.interpreter.IMain.$anonfun$interpret$1(IMain.scala:568) at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36) at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116) at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:41) at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:567) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:594) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:564) at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:219) at com.databricks.backend.daemon.driver.ScalaDriverLocal.$anonfun$repl$1(ScalaDriverLocal.scala:235) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:902) at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:855) at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:235) at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$13(DriverLocal.scala:541) at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:266) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:261) at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:258) at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:50) at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:305) at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:297) at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:50) at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:518) at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:689) at scala.util.Try$.apply(Try.scala:213) at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:681) at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:522) at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:634) at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:427) at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:370) at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:221) at java.lang.Thread.run(Thread.java:748)

Rajendramca commented 2 years ago

Any luck i am facing the same issue. with latest jar com.crealytics:spark-excel_2.13:3.2.0_0.16.1-pre1. Py4JJavaError Traceback (most recent call last)

in 20 # Do not load the kpi entries into the entry dataframe it's automated according to the csv file stored in GroupFunctions/Daily_Management/RAW/BULKUPLOADPA/DF CSV Files/Automated KPIs.csv 21 # Reading the Entries data from the Excel file ---> 22 DF_entry = spark.read.format("com.crealytics.spark.excel") \ 23 .option("header", "true") \ 24 .option("dataAddress" , "'" + sheetname + "'"+"!A1") \ /databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options) 202 self.options(**options) 203 if isinstance(path, str): --> 204 return self._df(self._jreader.load(path)) 205 elif path is not None: 206 if type(path) != list: /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args) 1302 1303 answer = self.gateway_client.send_command(command) -> 1304 return_value = get_return_value( 1305 answer, self.gateway_client, self.target_id, self.name) 1306 /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 115 def deco(*a, **kw): 116 try: --> 117 return f(*a, **kw) 118 except py4j.protocol.Py4JJavaError as e: 119 converted = convert_exception(e.java_exception) /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o2609.load. : java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B at org.apache.commons.io.output.AbstractByteArrayOutputStream.needNewBuffer(AbstractByteArrayOutputStream.java:104) at org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream.(UnsynchronizedByteArrayOutputStream.java:51) at shadeio.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:110) at shadeio.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:206) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:172) at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:55) at scala.Option.fold(Option.scala:251) at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:55) at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:16) at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:15) at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:50) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104) at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103) at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:172) at scala.Option.getOrElse(Option.scala:189) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:171) at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:390) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:444) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:400) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:400) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:287) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748)
sabrishami commented 2 years ago

I've also faced this error? is there any solution? Is there another way to read an xls file on Databricks?

mayankgupta15 commented 2 years ago

Faced the same error with the 0.16 and 0.16.1 versions of this library. But then I tried an older version (com.crealytics:spark-excel_2.12:0.14.0) and it is working like a charm now.

chowdarykish commented 2 years ago

NoClassDefFoundError: shadeio/commons/io/output/UnsynchronizedByteArrayOutputStream Caused by: ClassNotFoundException

Getting the same error in aws glue with Glue 3.0 Spark 3.1 version - using pyspark.

chowdarykish commented 2 years ago

I'm looking for the new version as I want to utilize the dynamic partitioning to create multiple excel files in parallel

chowdarykish commented 2 years ago

FYI, getting same error even with scala - AWS Glue 3.0 Scala 2 Spark 3.1

sabrishami commented 2 years ago

Faced the same error with the 0.16 and 0.16.1 versions of this library. But then I tried an older version (com.crealytics:spark-excel_2.12:0.14.0) and it is working like a charm now.

This clue works for me!!

Thanks

chowdarykish commented 2 years ago

but I need the dynamic partitioning feature, which is not available in the older version.

nightscape commented 2 years ago

As I don't have time to look into this, your best option is to try different versions of shading deps here. Once you have found a version that works on AWS / Azure Databricks, I'd be happy to do another pre-release and get it merged if it works for everyone.

Krishnapriya-RK commented 2 years ago

This

Faced the same error with the 0.16 and 0.16.1 versions of this library. But then I tried an older version (com.crealytics:spark-excel_2.12:0.14.0) and it is working like a charm now.

I was facing the same issue and this worked. Thank you

NageswarJetti commented 2 years ago

This

Faced the same error with the 0.16 and 0.16.1 versions of this library. But then I tried an older version (com.crealytics:spark-excel_2.12:0.14.0) and it is working like a charm now.

I was facing the same issue and this worked. Thank you

This worked for me .Thank you .

nightscape commented 2 years ago

Can you give v0.16.1-pre2 a try?

JoeNSalloum commented 2 years ago

Seem GitHub action for v0.16.1-pre2 has failed

nightscape commented 2 years ago

Right. I'm not sure where the GPG problem is coming from and if it has anything to do with the rather small changes I made to build.sbt.

rsradulescu commented 2 years ago

I had a lot of trouble running spark-excel, because of incompatible dependencies. Finally with these library versions I was able to write and read excel! I share for you. versiones_librerias_excel

patchatarch commented 2 years ago

Im experiencing the same issue. Im using the following versions: Scala: 2.1.2 spark-excel: 0.14.0 Spark: 3.1.2

Ive tried different versions such as: 0.13.1 0.13.4 0.14.0 0.15.0 0.16.1-pre1

Has anybody found a solution to this? image

vinodprathipati commented 2 years ago

As I don't have time to look into this, your best option is to try different versions of shading deps here. Once you have found a version that works on AWS / Azure Databricks, I'd be happy to do another pre-release and get it merged if it works for everyone.

Hi,

I took "pre2" code and built a local jar file and deployed in data bricks. Now I am not getting "commons-io" errors. But now, I am getting a different error.

IOException: Your InputStream was neither an OLE2 stream, nor an OOXML stream or you haven't provide the poi-ooxml*.jar in the classpath/modulepath - FileMagic: OOXML, having providers: [] at shadeio.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:334) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)

It seems, class loader is not loading the providers?

Any tips?

Vinod

nightscape commented 2 years ago

This might be fixed by https://github.com/crealytics/spark-excel/pull/513. Can you merge that into your local branch and give it a try? Unfortunately Github Actions are failing because of some GPG error and I haven't had time to look into it...

ymopurpg commented 2 years ago

While all the other versions were failing, I was able to install "com.crealytics:spark-excel_2.13:3.2.0_0.16.1-pre1"

ymopurpg commented 2 years ago

I get the following error when I try to read an excel file: ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider com.crealytics.spark.v2.excel.ExcelDataSource could not be instantiated

Any ideas?

nightscape commented 2 years ago

@ymopurpg yes, that's probably what was fixed in #513. Unfortunately the build is currently broken, and I don't have time to look into that...

Vilka284 commented 2 years ago

I experienced the same issue as @grajee-everest mentioned. The following packages version helped me: spark-submit ... --packages com.crealytics:spark-excel_2.12:0.14.0,commons-io:commons-io:2.11.0 ...

nightscape commented 2 years ago

I gave it another try in 0.16.5-pre1. Can someone try that out?

ymopurpg commented 2 years ago

@nightscape - I really appreciate it if you please provide the full version (com.crealytics:spark-excel_2.12:0.16.5-pre1 or something else?). I tried so many versions in the past few days and don't know which one to use.

abrocklesbykpmg commented 2 years ago

Tried v0.16.5-pre1 this morning and getting the following error...

Py4JJavaError: An error occurred while calling o228.load.
: java.lang.NoClassDefFoundError: Could not initialize class shadeio.poi.util.IOUtils
    at shadeio.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209)
    at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:222)
    at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
    at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:111)
    at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:127)
    at com.crealytics.spark.v2.excel.ExcelTable.infer(ExcelTable.scala:72)
    at com.crealytics.spark.v2.excel.ExcelTable.inferSchema(ExcelTable.scala:43)
    at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69)
    at scala.Option.orElse(Option.scala:447)
    at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69)
    at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63)
    at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82)
    at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80)
    at com.crealytics.spark.v2.excel.ExcelDataSource.inferSchema(ExcelDataSource.scala:85)
    at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:296)
    at scala.Option.map(Option.scala:230)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:266)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
ln-data-bass commented 2 years ago

I gave it another try in 0.16.5-pre1. Can someone try that out?

I had the same issue described above but upgrading to com.crealytics:spark-excel_2.12:3.1.2_0.16.5-pre1 resolved the issue for me and it's now working for me. I have no other custom libraries installed on the databricks cluster other than com.crealytics:spark-excel_2.12:3.1.2_0.16.5-pre1

nightscape commented 2 years ago

Can everybody try 0.16.5-pre2 and report back here please?

alexjbush commented 2 years ago

I just tried com.crealytics:spark-excel_2.12:3.2.1_0.16.5-pre2 on 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12) and get the following error:

val existing = spark.read.format("excel").option("header", "true").load("example.xlsx")
display(existing)

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.75.73.234 executor 0): java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.util.FailureSafeParser.<init>(Lscala/Function1;Lorg/apache/spark/sql/catalyst/util/ParseMode;Lorg/apache/spark/sql/types/StructType;Ljava/lang/String;)V
    at com.crealytics.spark.v2.excel.ExcelParser$.parseIterator(ExcelParser.scala:423)
    at com.crealytics.spark.v2.excel.ExcelPartitionReaderFactory.readFile(ExcelPartitionReaderFactory.scala:75)
    at com.crealytics.spark.v2.excel.ExcelPartitionReaderFactory.buildReader(ExcelPartitionReaderFactory.scala:61)
    at org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.$anonfun$createReader$1(FilePartitionReaderFactory.scala:30)
    at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
    at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.getNextReader(FilePartitionReader.scala:99)
    at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:43)
    at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:94)
    at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:131)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
    at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
    at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155)
    at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156)
    at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.Task.run(Task.scala:95)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:825)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1644)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:828)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2984)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2931)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2925)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2925)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1345)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1345)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1345)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3193)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3134)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3122)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1107)
    at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2561)
    at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:266)
    at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:276)
    at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:81)
    at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:87)
    at org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:75)
    at org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:62)
    at org.apache.spark.sql.execution.ResultCacheManager.collectResult$1(ResultCacheManager.scala:587)
    at org.apache.spark.sql.execution.ResultCacheManager.computeResult(ResultCacheManager.scala:596)
    at org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:542)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:541)
    at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:438)
    at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:417)
    at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:422)
    at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3132)
    at org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3123)
    at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3930)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$6(SQLExecution.scala:195)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:342)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:153)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:852)
    at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:115)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:292)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3928)
    at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3122)
    at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:268)
    at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:102)
    at com.databricks.backend.daemon.driver.ScalaDriverLocal.$anonfun$getResultBufferInternal$3(ScalaDriverLocal.scala:345)
    at scala.Option.map(Option.scala:230)
    at com.databricks.backend.daemon.driver.ScalaDriverLocal.$anonfun$getResultBufferInternal$1(ScalaDriverLocal.scala:325)
    at scala.Option.map(Option.scala:230)
    at com.databricks.backend.daemon.driver.ScalaDriverLocal.getResultBufferInternal(ScalaDriverLocal.scala:289)
    at com.databricks.backend.daemon.driver.DriverLocal.getResultBuffer(DriverLocal.scala:712)
    at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:267)
    at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$11(DriverLocal.scala:602)
    at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$2(UsageLogging.scala:232)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:94)
    at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:230)
    at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:212)
    at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:60)
    at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:276)
    at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:261)
    at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:60)
    at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:579)
    at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:615)
    at scala.util.Try$.apply(Try.scala:213)
    at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:607)
    at com.databricks.backend.daemon.driver.DriverWrapper.executeCommandAndGetError(DriverWrapper.scala:526)
    at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:561)
    at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:431)
    at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:374)
    at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:225)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.util.FailureSafeParser.<init>(Lscala/Function1;Lorg/apache/spark/sql/catalyst/util/ParseMode;Lorg/apache/spark/sql/types/StructType;Ljava/lang/String;)V
    at com.crealytics.spark.v2.excel.ExcelParser$.parseIterator(ExcelParser.scala:423)
    at com.crealytics.spark.v2.excel.ExcelPartitionReaderFactory.readFile(ExcelPartitionReaderFactory.scala:75)
    at com.crealytics.spark.v2.excel.ExcelPartitionReaderFactory.buildReader(ExcelPartitionReaderFactory.scala:61)
    at org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.$anonfun$createReader$1(FilePartitionReaderFactory.scala:30)
    at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
    at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.getNextReader(FilePartitionReader.scala:99)
    at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:43)
    at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:94)
    at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:131)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
    at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
    at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155)
    at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156)
    at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.Task.run(Task.scala:95)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:825)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1644)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:828)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more
nightscape commented 2 years ago

@alexjbush strange, that looks like the Spark version doesn't match the one from spark-excel, but from your description it clearly should... Maybe there is sth. wrong with our cross-publishing build.sbt.

abhisrphoenix commented 2 years ago

Add the spark-excel as maven dependency instead of jar, this resolved my issue.

image