crealytics / spark-excel

A Spark plugin for reading and writing Excel files
Apache License 2.0
463 stars 147 forks source link

[BUG] Excel File with Macros Detected as "Potentially" Malicious. Unable to read Excel as a result. #832

Open nova-jj opened 7 months ago

nova-jj commented 7 months ago

Is there an existing issue for this?

Current Behavior

Within an Azure Databricks Environment we're using this library to read Excel files stored in a Storage Account accessed using either the ABFSS or DBFS protocols, suggesting this is a file issue and not a protocol issue. . Attempting to read the file with newer versions of the spark-excel library result in the following error caused by macros in the workbook: crealytics excel workbook java.io.IOException: The file appears to be potentially malicious. "This file embeds more internal file entries than expected."

We have reverted to a previous version that does not present this error and are looking for a solution that allows us to bypass the macro detection in our workbook which does contain macros, but are required as part of the workbook.

Expected Behavior

Reading the file into a dataframe should not be met with this error, OR, an option to override the macro detection in order to be able to force-read when "potentially" maliciousness is present.

Steps To Reproduce

The following python code produces our error:

file_path= "dbfs:/FileStore/our_excel_file.xlsm"
df = spark.read.format("com.crealytics.spark.excel").option("header", "true").load(file_path)
df = df.toPandas()

Environment

- Spark version: 3.4.1 via Databricks Runtime 13.3
- Spark-Excel version: 3.5.0_0.20.3
- OS: Windows but remote-run from Databricks clusters
- Cluster environment: Multiple cluster configurations representing dev/stg/prd using the same Databricks Runtime and Spark Versions.

Anything else?

We have reverted to using the previous version maven coordinates: com.crealytics:spark-excel_2.12:0.13.7 for our install which does not produce this issue.

nightscape commented 7 months ago

spark-excel doesn't do anything in that regard. It must be an upstream library that performs this check. Can you try to find out if this comes from POI?