Loading Excel with PERMISSIVE on EMR fails while it works locally (on Windows)

christianknoepfle commented 2 weeks ago

Am I using the newest version of the library?

[X] I have made sure that I'm using the latest version of the library.

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

We have an excel that has a varying length of populated rows (excess values and missing values). So using PERMISSIVE and just ignore excess values and set missing values to null is fine.

On AWS EMR (6.14.0, spark 3.4.1) we see incosistent / random behavior. On some clusters it works, on others it fails with 24/06/14 09:36:15 WARN TaskSetManager: Lost task 6.0 in stage 0.0 (TID 6) (ip-10-107-10-248.eu-central-1.compute.internal executor 2): org.apache.spark.SparkException: Encountered error while reading file s3://somebucket/somefile Details: at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:878) at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:80) [...] Caused by: java.lang.ClassCastException: scala.Some cannot be cast to [Ljava.lang.Object; at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:74) at com.crealytics.spark.excel.v2.ExcelParser$.$anonfun$parseIterator$2(ExcelParser.scala:432) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at org.apache.spark.sql.execution.datasources.v2.PartitionReaderFromIterator.next(PartitionReaderFromIterator.scala:26) at org.apache.spark.sql.execution.datasources.v2.PartitionReaderWithPartitionValues.next(PartitionReaderWithPartitionValues.scala:48) at org.apache.spark.sql.execution.datasources.v2.PartitionedFileReader.next(FilePartitionReaderFactory.scala:58) at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:65) ... 49 more Surprisingly It does not matter whether mode is FAILFAST or PERMISSIVE.

Cluster config is always the same, they just differ by hardware specs and number of workers

On my Windows machine I do not have a problem when using PERMISSIVE and with FAILFAST I get a good error message Caused by: org.apache.spark.SparkException: [MALFORMED_RECORD_IN_PARSING] Malformed records are detected in record parsing: [Merkmale,null,null,null]. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.

Note: I am not using the official package of spark excel because this fat jar package does not work on EMR due to various reasons. The source code is the same, the included libs and versions are different.

Expected Behavior

I would expect that AWS EMR gives me comparable results to a local installation. I have no real idea why we see such a different behavior. Maybe due to distribution to mutliple machines the code is handled in a different way.

I root caused the whole thing down to v2.ExcelParser When I handle the bad record issue by myself (not throwing the NonFatal exception) and assume we are running PERMISSIVE mode it works on EMR. So basically something like this:

I would like to port something along these lines back to spark excel so my source code does not differ and I do not have to worry about that (still I have to do my own packaging, but that is not such a big deal). @nightscape Would you generally support such a change? In that case I would start working on a PR.

Any other/further thoughts on this issue?

Thanks

Christian

Steps To Reproduce

No response

Environment

- Spark version:
- Spark-Excel version:
- OS:
- Cluster environment

Anything else?

No response

christianknoepfle commented 2 weeks ago

Ah I found #808 but IMO AWS EMR uses plain spark. And I was not on latest code :( I will check again and let you know the results

christianknoepfle commented 2 weeks ago

It was my fault. Sorry for bothering

nightscape commented 2 weeks ago

No worries @christianknoepfle 😃

crealytics / spark-excel