elastacloud / spark-excel

A Spark data source for reading Microsoft Excel files
https://www.elastacloud.com
Apache License 2.0
13 stars 5 forks source link

ClassCast Exeption while reading XLSX files #25

Open prassee opened 1 year ago

prassee commented 1 year ago

While reading xlsx files the job throws ClassCastException. The artifact , sample code and error log is attached below. (Note :- since this is not available in maven central I loaded the jar under lib dir) Spark Version - 3.2.0 Scala Version - 2.12.15

  val s3Path    = s"s3a://.../*.xlsx"

  val xlsStmts = spark.read
    .format("com.elastacloud.spark.excel")
    .option("cellAddress", "A1") // The first line of the table starts at cell C3
    .option(
      "sheetNamePattern",
      """Xns"""
    )                           // Read data from all sheets matching this pattern (e.g. Sheet1 and Sheet3)
    .option("maxRowCount", 100) // Read only the first 10 records to determine the schema of the data
    .option("thresholdBytesForTempFiles", 50000000) // Setdd
    .load(s3Path)

Error Log

java.lang.ClassCastException: class org.apache.xmlbeans.impl.values.XmlComplexContentImpl 
cannot be cast to class elastashade.poi.schemas.vmldrawing.XmlDocument 

(org.apache.xmlbeans.impl.values.XmlComplexContentImpl and elastashade.poi    
.schemas.vmldrawing.XmlDocument are in unnamed module of loader java.net.URLClassLoader @10bf3464) at 
elastashade.poi.xssf.usermodel.XSSFVMLDrawing.read(XSS    FVMLDrawing.java:147) at 
elastashade.poi.xssf.usermodel.XSSFVMLDrawing.<init>(XSSFVMLDrawing.java:123) at 
elastashade.poi.ooxml.POIXMLFactory.createDocument    Part(POIXMLFactory.java:61) at 
elastashade.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:661) at 
elastashade.poi.ooxml.POIXMLDocumentPart.read(P    OIXMLDocumentPart.java:678) at 
elastashade.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:165) at 
elastashade.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSF    Workbook.java:259) at 
elastashade.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook( 
dazfuller commented 1 year ago

Do you have a sample of the source file that you can share?

prassee commented 1 year ago

Its mentioned in the desc of the ticket

dazfuller commented 1 year ago

Sorry, I meant the xlsx file specifically. The library happily parses all xlsx files I've given it, so it would be useful to see the kind of data which is leading to this error

prassee commented 1 year ago

Sorry again the XLSX files do contain financial txns hence cannot share it here. However it has the following hierarchy of worksheets

root
|_ Xns
|_ Xns inbound
|_ Xns outbound

Each worksheet mentioned about has the same columns

Sl. No.     Date    Cheque No.  Description Amount  Category    Balance

On the other hand I have the following points