elastacloud / spark-excel

A Spark data source for reading Microsoft Excel files
https://www.elastacloud.com
Apache License 2.0
13 stars 5 forks source link

Unable to load files larger than 14 MB #18

Closed asemaan closed 2 years ago

asemaan commented 2 years ago

I'm using synapse analytics pyspark (python) the spreadsheet is 33 MB, has two sheets with 58 columns each. Sheet 1 has 5k rows and sheet 2 is 100K rows. I have another excel spreadsheet that is 88 MB. Not sure why the limitation exists but any help and direction to solving this is welcomed.

I'm getting the following error and after much searching, it seems there is 14 MB limit on the spreadsheet size. To confirm, I was able to load smaller file. I was unable to find a way to leverage IOUtils.setByteArrayMaxOverride()

Py4JJavaError: An error occurred while calling o700.load. : elastashade.poi.util.RecordFormatException: Tried to allocate an array of length 205,369,965, but the maximum length for this record type is 100,000,000. If the file is not corrupt, please open an issue on bugzilla to request increasing the maximum allowable size for this record type. As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()

dazfuller commented 2 years ago

There's not a specific limit in the library, but that's looking like an issue in the POI library used for accessing the Excel file itself. And looks like the fix is something that needs adding in the code directly, at least looking at this SO question.

asemaan commented 2 years ago

The message mentions "As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()"

I have not been able to figure out how to do this in PySpark in Synapse Analytics. Any guidance is appreciated.

dazfuller commented 2 years ago

So looks like that won't work either because of a known bug in Apache POI (which means I need to get a move on with the upgrade to 5.2.1). But I have seen in the bug report that running this should work

elastashade.poi.openxml4j.util.ZipInputStreamZipEntrySource.setThresholdBytesForTempFiles(100000000)

The jar shades the POI libraries, which is why it's got "elastashade" at the front.

You'll need to run this in a scala cell, otherwise you'll have to go through the Py4J interop

dazfuller commented 2 years ago

This should be resolved in version 0.1.9 @asemaan

asemaan commented 2 years ago

@dazfuller, this awesome. I was to load the jar to my cluster and process a 88 MB file and a 39 MB file that had previously failed before.

Big THANKS!

dazfuller commented 2 years ago

Glad it worked 😄