elastacloud / spark-excel

A Spark data source for reading Microsoft Excel files
https://www.elastacloud.com
Apache License 2.0
13 stars 5 forks source link

Java Heap Out of Memory Error #11

Open soleyjh opened 2 years ago

soleyjh commented 2 years ago

Hi! And thanks so much for writing this great package!

When I run the following command:

df = spark.read.format("com.elastacloud.spark.excel").option("cellAddress", "A1").load(file_location)

I get the following error:

java.lang.OutOfMemoryError: Java heap space

The excel file is 218MB roughly 750K rows with ~50 or so fields (no long text strings).

and I'm running in Azure Databricks: 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12) Running: Standard_F4 with 8GB Memory and 4 Cores

I downloaded the following JAR and Installed on the Cluster (Above): spark_excel_3_1_2_0_1_7.jar

Any Help or Advice would be appreciated.

dazfuller commented 2 years ago

Hi, and thanks for the nice feedback 🙂

I'm guessing that the file you're opening is an .xlsx file which means that 218MB is it's compressed size. The library has to decompress the entire file so that it can read it (this is so the formula evaluation works), which can lead to OOM errors. I'm looking at having a simplified version which lets the library stream the data out without the formula evaluation.

One of the things you could try is to use a memory optimized sku, or a standard sku with more memory available. There's also the option of setting the JVM options to allow for a bigger heap size, -Xmx4096m for instance.

dazfuller commented 2 years ago

Hi @soleyjh did this help out at all?

josecsotomorales commented 2 months ago

I also faced this issue, trying to think of a workaround, can we disable the formula evaluation as part of the configuration and just read static data from Excel?

dazfuller commented 2 months ago

It's definitely doable, I'll need to handle the cell evaluation slightly differently based on the config, but let me have a look and see what I can do with it

dazfuller commented 2 months ago

Okay, the code is in there to handle this now in the 0.1.13 branch, so you can pass in an option like this

.option("evaluateFormulae", "false")

What this will do instead is extract the formula itself (e.g. "A7*2"), but it won't attempt to evaluate it. If you let me know which spark version you're targeting I can create a jar for you from that branch to test with. Or check out the branch and build it yourself if fancy :)

josecsotomorales commented 2 months ago

I'm going to test it on my local by checking out the branch

josecsotomorales commented 2 months ago

Alright, I have some insights, it helped a LOT but I still have a Java Heap OOM. The file is XLSX approx 300MB. I processed it with https://github.com/crealytics/spark-excel using the streaming reader config and it works:

.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)

I don't know how we can add this streaming as an option to this project.

dazfuller commented 2 months ago

It is something I've looked at before, but you lose a lot of the features using the streaming reader. My current thought is to have a different parser which uses the stream reader and a heavy check on the options provided.

Is processing locally your usage environment?

josecsotomorales commented 2 months ago

That makes sense. I'm currently performing my testing by running Spark on K8S driver 8CPU 64GB RAM + executors same size.

dazfuller commented 2 months ago

What's the JVM heap size set to?

josecsotomorales commented 2 months ago

It's set up to 4GB

dazfuller commented 2 months ago

Any success if you increase that size?

dazfuller commented 2 months ago

I'm going to get the 0.1.13 release out and then start looking at the 0.1.14 with a view to creating a companion parser which uses the SXSSF (streaming) reader. It means that it won't have the ability to do formula evaluation, and merged cells over a certain size, but it should reduce the memory required by the POI library

dazfuller commented 2 months ago

For info, there's a branch called dazfuller/streaming-reading which has a working version of the streaming reader. Needs quite a lot more testing yet, but the core if it is in place

josecsotomorales commented 2 months ago

Excellent work @dazfuller!! Will review and perform some testing on that branch