crealytics / spark-excel

A Spark plugin for reading and writing Excel files
Apache License 2.0
449 stars 145 forks source link

[BUG] No thrown exception if schema is provieded, but there is no workbook/sheet (PDF with XLSX Extension) #844

Open dabruehl opened 3 months ago

dabruehl commented 3 months ago

Is there an existing issue for this?

Current Behavior

If you store a PDF test.pdf with the extension XLSX -> text.xlsx and read the file with a schema provided, you get an empty dataframe. Even if there is no workbook and no data to read.

Expected Behavior

If there is no workbook/sheet the reader should throw an exception/error.

Exactly as it happens if you do not add a schema:

options = {
        "header": "true",
        "encoding": "UTF-8",
        "dataAddress": "workbook:A1",
        "mode" : "FAILFAST"
      }
df = spark.read.format("excel").options(**options).load("PATH_TO_YOUR_FILE/FILENAME.xlsx")
df.show()

Py4JJavaError: An error occurred while calling o575.load. : java.io.IOException: Can't open workbook - unsupported file type: PDF at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:228) at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185) at com.crealytics.spark.excel.v2.ExcelHelper.getWorkbook(ExcelHelper.scala:120) at com.crealytics.spark.excel.v2.ExcelHelper.getSheetData(ExcelHelper.scala:137) at com.crealytics.spark.excel.v2.ExcelHelper.parseSheetData(ExcelHelper.scala:160) at com.crealytics.spark.excel.v2.ExcelTable.infer(ExcelTable.scala:77) at com.crealytics.spark.excel.v2.ExcelTable.inferSchema(ExcelTable.scala:48) at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:71) at scala.Option.orElse(Option.scala:447) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:71) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:65) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:83) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:81) at org.apache.spark.sql.connector.catalog.Table.columns(Table.java:68) at org.apache.spark.sql.execution.datasources.v2.FileTable.columns(FileTable.scala:37) at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema(FileDataSourceV2.scala:95) at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema$(FileDataSourceV2.scala:92) at com.crealytics.spark.excel.v2.ExcelDataSource.inferSchema(ExcelDataSource.scala:27) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:99) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:237) at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:344) at scala.Option.flatMap(Option.scala:271) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:342) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397) at py4j.Gateway.invoke(Gateway.java:306) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199) at py4j.ClientServerConnection.run(ClientServerConnection.java:119) at java.lang.Thread.run(Thread.java:750)

Steps To Reproduce

Store any PDF with the file extension .xlsx and then run following code:

from pyspark.sql.types import *

schema = StructType([
    StructField("Code", DoubleType()),
    StructField("Text", StringType())
])
options = {
        "header": "true",
        "encoding": "UTF-8",
        "dataAddress": "workbook:A1",
        "mode" : "FAILFAST"
      }
df = spark.read.format("excel").schema(schema).options(**options).load("PATH_TO_YOUR_FILE/FILENAME.xlsx")
df.show()

+----+----+
|Code|Text|
+----+----+
+----+----+

Environment

- Spark version: 3.5.0
- Spark-Excel version: 0.20.2
- OS: Databricks Runtime 14.3 LTS
- Cluster environment

Anything else?

No response