crealytics / spark-excel

A Spark plugin for reading and writing Excel files
Apache License 2.0
451 stars 145 forks source link

[Read an Excel File]: GC overhead limit exceeded #322

Open cedtnee opened 3 years ago

cedtnee commented 3 years ago

Expected Behavior

I try to read an excel file size 35MB and write the result as orc files The input file have one sheet which have only values, not functions or macros

Current Behavior

On my readerExcel.scala , i do this : val df = spark.read .format("com.crealytics.spark.excel") .option("header", "true") // Required .option("treatEmptyValuesAsNulls", "false") // Optional, default: true .option("inferSchema", "false") // Optional, default: false .option("sheetName", "Feuil1") .schema(StructType( fields .map(fieldName => StructField(fieldName, StringType, nullable = true)))) // Optional, default: Either inferred schema, or all columns are Strings .load("file.xlsx")

I have this error Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.regex.Pattern.matcher(Pattern.java:1093) at shadeio.poi.ss.util.CellReference.separateRefParts(CellReference.java:396) at shadeio.poi.ss.util.CellReference.(CellReference.java:113) at shadeio.poi.xssf.usermodel.XSSFCell.(XSSFCell.java:118) at shadeio.poi.xssf.usermodel.XSSFRow.(XSSFRow.java:75) at shadeio.poi.xssf.usermodel.XSSFSheet.initRows(XSSFSheet.java:268) at shadeio.poi.xssf.usermodel.XSSFSheet.read(XSSFSheet.java:231) at shadeio.poi.xssf.usermodel.XSSFSheet.onDocumentRead(XSSFSheet.java:218) at shadeio.poi.xssf.usermodel.XSSFWorkbook.parseSheet(XSSFWorkbook.java:448) at shadeio.poi.xssf.usermodel.XSSFWorkbook.onDocumentRead(XSSFWorkbook.java:413) at shadeio.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184) at shadeio.poi.xssf.usermodel.XSSFWorkbook.(XSSFWorkbook.java:282) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:88) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:135) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at shadeio.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:339) at shadeio.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:314) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:232) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:198) at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50) at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50) at scala.Option.fold(Option.scala:158) at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:50) at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:14) at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:46) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:30) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:30) at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104) at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103) 2020-11-26 21:36:57,729 [Driver] INFO org.apache.spark.deploy.yarn.ApplicationMaster - Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.io.IOException: GC overhead limit exceeded at shadeio.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:351) at shadeio.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:314) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:232) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:198) at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50) at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50) at scala.Option.fold(Option.scala:158) at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:50) at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:14) at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:46) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:30) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:30) at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104) at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103) at com.crealytics.spark.excel.ExcelRelation.headerColumnForName$lzycompute(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.headerColumnForName(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.com$crealytics$spark$excel$ExcelRelation$$columnExtractor(ExcelRelation.scala:51) at com.crealytics.spark.excel.ExcelRelation$$anonfun$4.apply(ExcelRelation.scala:61) at com.crealytics.spark.excel.ExcelRelation$$anonfun$4.apply(ExcelRelation.scala:61) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at com.crealytics.spark.excel.ExcelRelation.buildScan(ExcelRelation.scala:61) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:300) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:300) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:338) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:337) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:415) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:333) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:296) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72) at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3248) at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)

Troubleshouting

I have increased the memory size but still getting the same issue I set spark options relative to Heap, but still getting the same issue

Your Environment

Spark version : Spark 2.3 language: Scala

  • Spark-Excel version: com.crealytics spark-excel_2.11 0.13.1
nightscape commented 3 years ago

Can you try something along the lines of .option("maxRowsInMemory", 20) (see the README)?

cedtnee commented 3 years ago

When i used .option("maxRowsInMemory", 20),the result dataframe is empty. When i increased the .option("maxRowsInMemory", 30000) i have this error: An exception or error caused a run to abort: GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded at com.sun.xml.internal.stream.events.StartElementEvent.init(StartElementEvent.java:76) at com.sun.xml.internal.stream.events.StartElementEvent.(StartElementEvent.java:64) at com.sun.xml.internal.stream.events.XMLEventAllocatorImpl.getXMLEvent(XMLEventAllocatorImpl.java:76) at com.sun.xml.internal.stream.events.XMLEventAllocatorImpl.allocate(XMLEventAllocatorImpl.java:53) at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl.java:84) at shadeio.monitorjbl.xlsx.impl.StreamingSheetReader.getRow(StreamingSheetReader.java:71) at shadeio.monitorjbl.xlsx.impl.StreamingSheetReader.access$200(StreamingSheetReader.java:32) at shadeio.monitorjbl.xlsx.impl.StreamingSheetReader$StreamingRowIterator.hasNext(StreamingSheetReader.java:402) at shadeio.monitorjbl.xlsx.impl.StreamingSheetReader$StreamingRowIterator.(StreamingSheetReader.java:396) at shadeio.monitorjbl.xlsx.impl.StreamingSheetReader.iterator(StreamingSheetReader.java:382) at shadeio.monitorjbl.xlsx.impl.StreamingSheet.iterator(StreamingSheet.java:49) at com.crealytics.spark.excel.AreaDataLocator$class.readFromSheet(DataLocator.scala:91) at com.crealytics.spark.excel.CellRangeAddressDataLocator.readFromSheet(DataLocator.scala:134) at com.crealytics.spark.excel.CellRangeAddressDataLocator.readFrom(DataLocator.scala:144) at com.crealytics.spark.excel.ExcelRelation$$anonfun$buildScan$2.apply(ExcelRelation.scala:63) at com.crealytics.spark.excel.ExcelRelation$$anonfun$buildScan$2.apply(ExcelRelation.scala:62) at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:15) at com.crealytics.spark.excel.StreamingWorkbookReader.withWorkbook(WorkbookReader.scala:55) at com.crealytics.spark.excel.ExcelRelation.buildScan(ExcelRelation.scala:62) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:300) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:300) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:338) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:337) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:415) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:333) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:296) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)

nightscape commented 3 years ago

I'm not surprised by the OutOfMemoryError, but getting an empty result is weird... Could you try some more values in between 20 and 30000?

AlexZhang267 commented 3 years ago

I used .option("maxRowsInMemory", 200), but it's still often to get an OOM error. My excel file is only 16MB. This is error message.
diagnostics: User class threw exception: java.lang.OutOfMemoryError: Java heap space at org.apache.xmlbeans.impl.store.Cur.createElementXobj(Cur.java:260) at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.startElement(Cur.java:2997) at org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3164) at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source) at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3422) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259) at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) at org.openxmlformats.schemas.spreadsheetml.x2006.main.SstDocument$Factory.parse(Unknown Source) at shadeio.poi.xssf.model.SharedStringsTable.readFrom(SharedStringsTable.java:123) at shadeio.poi.xssf.model.SharedStringsTable.(SharedStringsTable.java:111) at shadeio.poi.xssf.eventusermodel.XSSFReader.getSharedStringsTable(XSSFReader.java:115) at shadeio.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:132) at shadeio.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:90) at shadeio.pjfanning.xlsx.StreamingReader$Builder.open(StreamingReader.java:307) at com.crealytics.spark.excel.StreamingWorkbookReader.openWorkbook(WorkbookReader.scala:63) at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:14) at com.crealytics.spark.excel.StreamingWorkbookReader.withWorkbook(WorkbookReader.scala:54) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:31) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:31) at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:102) at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:101) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:163) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:162)

nightscape commented 3 years ago

@AlexZhang267 I unfortunately can't invest much time into spark-excel at the moment. If possible, I would recommend to increase Spark's memory to work around this.

kennydataml commented 2 years ago

I'm getting the same problem here. tried using "maxRowsinMemory" with values like 1000, 10000, 100000. It works fine for the first 3 excel files but I have 8 to process and the driver node always dies on the 4th excel file. Each file is roughly 600 MB each. I tried spark.driver.extraJavaOptions -XX:+UseG1GC and increased driver memory to 56 GB but the driver node still crashes. My understanding is the driver will crash if there's too many occurrences of full garbage collection. I also noticed the cluster doesn't scale up beyond 1 worker node even though I have auto scale set to 4 worker nodes. So clearly there is some type of collect() happening on the driver node with this library? and this library doesn't seem to use the spark parallelism very well?

quanghgx commented 2 years ago

Hi @kennydataml , I am not sure if this is related. How do you read these 8 files? It will help if we have simplified (without sensitive data) and reproducible code. I am preparing a PR for native support of multiple files loading. Sincerely,

kennydataml commented 2 years ago

I unfortunately can't get a scrubbed excel for you since it's on a client laptop. the code is simple. I only read in one excel at a time with a for loop. so basically

for xlpath in excels:
  csvpath = xlpath split join yadayda
  try:  # exception handling since we don't know the number of sheets
    for i in range(15):  # dynamic number of sheets
      df = (spark.read
        .format("crealytics ... spark excel yada yada")
        .option("dataAddress", f"{i}!A1")  # sub sheet index here
        .option("header", "true")
        .option("maxRowsInMemory", 100000)
        .load(xlpath)
      # write excel to csv
      (df.write
      .repartition(200)  # attempting to circumvent memory issues
      .format("csv")
      .mode("append")
      .option("header", "true")
      .save(csvpath)
      )
  except Exception as err:
    print(repr(err))

I've narrowed down the problem to only 1 of 8 excel files. I can consistently reproduce it on that particular excel file. It opens up just fine using microsoft excel, so I'm puzzled why only 1 particular excel file gives me an issue. The behaviour I'm observing is I can read the first 6 sheets, but it hangs on the 7th sheet when it tries to append write to the csv path. There is no spark job initiated for writing as expected, and it just hangs for an hour straight. I've tried running the code without looping through all excels, targeting only that particular excel file and it will hang consistently on the 7th sheet. It's really weird, I'm going to do some more testing tomorrow with a fresh copy of the data.

NestorAGC123 commented 2 years ago

I have the same issue and my excel file is only 2 MB, It also happens on some specific files

quanghgx commented 2 years ago

Hi @NestorAGC123 , Is it possible to share your excel file after removing sensible data?

pjfanning commented 2 years ago

the problem is that spark-excel reads the file as an input stream and that uses far more memory than reading it as java.io.File - I have logged https://bz.apache.org/bugzilla/show_bug.cgi?id=65581 for a possible solution

yanghong commented 2 years ago

Does this problem still unclosed? I have the same problem.

pjfanning commented 2 years ago

@yanghong maybe you could do a PR based on https://bz.apache.org/bugzilla/show_bug.cgi?id=65581 changes ?

yanghong commented 2 years ago

@yanghong maybe you could do a PR based on https://bz.apache.org/bugzilla/show_bug.cgi?id=65581 changes ?

Fine, it‘s poi’s limitaion.