crealytics / spark-excel

A Spark plugin for reading and writing Excel files
Apache License 2.0
449 stars 145 forks source link

Extract sheet names using pyspark #856

Open Krukosz opened 2 months ago

Krukosz commented 2 months ago

Am I using the newest version of the library?

Is there an existing issue for this?

Current Behavior

I have problem with class WorkbookReader. Code in Python looks like:

reader = spark._jvm.com.crealytics.spark.excel.WorkbookReader( {"path": "Worktime.xlsx"}, spark.sparkContext._jsc.hadoopConfiguration() ) sheetnames = reader.sheetNames()

My problems:

  1. I cannot use hadoopConfiguration explicitly due to security options
  2. When I omit second argument in constructor I get error:

py4j.Py4JException: Constructor com.crealytics.spark.excel.WorkbookReader([class java.util.HashMap]) does not exist

In PR #196 there's a discussion about using apply method but I don't know how to call it.

Is there anyone who made working it on PySpark? I can't use Scala, because is blocked by administrator in my environment.

Expected Behavior

No response

Steps To Reproduce

No response

Environment

- Spark version:Apache Spark 3.4.1
- Spark-Excel version: Scala 2.1
- OS:
- Cluster environment: Databricks 13.3 LTS

Anything else?

No response

nightscape commented 2 months ago

Does this help? https://github.com/crealytics/spark-excel/pull/196#issuecomment-1376972780

Krukosz commented 2 months ago

Oh, I tested it on "legacy" Databricks Cluster and it works.

My code:

reader = spark._jvm.com.crealytics.spark.excel.WorkbookReader.apply({"path": 'my_file.xlsx'}, spark.sparkContext._jsc.hadoopConfiguration())

d = reader.sheetNames()

print(d)

In Unity Catalog environment i'm getting error (it's directly related to Cluster Mode, it cannot be changed in my case):

py4j.security.Py4JSecurityException: Method public org.apache.hadoop.conf.Configuration org.apache.spark.api.java.JavaSparkContext.hadoopConfiguration() is not whitelisted on class class org.apache.spark.api.java.JavaSparkContext

Is there any other way to get sheet names, without WorkbookReader constructor? I'd rather not mixing crealytics spark code with pandas or any other library.

nightscape commented 2 months ago

This sounds related: https://learn.microsoft.com/en-us/answers/questions/1193968/py4j-security-py4jsecurityexception-databricks