An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Bug: Not able to run PySpark code with Delta on k8s cluster on GCP
Which Delta project/connector is this regarding?
[x] Spark
[ ] Standalone
[ ] Flink
[ ] Kernel
[ ] Other (fill in here)
Describe the problem
I am attempting to execute the PySpark code, which involves reading and writing to Delta on a GCS bucket.
I am using the k8s cluster and running the spark-submit command.
For normal PySpark code involving Parquet files. It is a working file but for delta, I am getting the following error. I have been trying this with multiple different versions for the last week.
File "/tmp/spark-f1f15d72-8494-4597-8d63-ccc0b33a52e4/pyspark_gcs_parquet_read.py", line 38, in <module>
extract()
File "/tmp/spark-f1f15d72-8494-4597-8d63-ccc0b33a52e4/pyspark_gcs_parquet_read.py", line 27, in extract
input_df.write.mode("overwrite").format("delta").save("/tmp/delta-table2")
File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1463, in save
File "/opt/bitnami/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
File "/opt/bitnami/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o226.save.
: java.lang.VerifyError: Bad type on operand stack
Exception Details:
Location:
org/apache/spark/sql/delta/stats/StatisticsCollection$SqlParser$$anon$1.visitMultipartIdentifierList(Lorg/apache/spark/sql/catalyst/parser/SqlBaseParser$MultipartIdentifierListContext;)Lscala/collection/Seq; @17: invokevirtual
24/10/03 06:41:32 INFO SparkContext: SparkContext is stopping with exitCode 0.
Reason:
Type 'org/apache/spark/sql/catalyst/parser/SqlBaseParser$MultipartIdentifierListContext' (current frame, stack[1]) is not assignable to 'org/antlr/v4/runtime/ParserRuleContext'
Current Frame:
bci: @17
flags: { }
locals: { 'org/apache/spark/sql/delta/stats/StatisticsCollection$SqlParser$$anon$1', 'org/apache/spark/sql/catalyst/parser/SqlBaseParser$MultipartIdentifierListContext' }
stack: { 'org/apache/spark/sql/catalyst/parser/ParserUtils$', 'org/apache/spark/sql/catalyst/parser/SqlBaseParser$MultipartIdentifierListContext', 'scala/Option', 'scala/Function0' }
Bytecode:
0000000: b200 232b b200 23b6 0027 2a2b ba00 3f00
0000010: 00b6 0043 c000 45b0
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?
[ ] Yes. I can contribute a fix for this bug independently.
[ ] Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
[ ] No. I cannot contribute a bug fix at this time.
Bug: Not able to run PySpark code with Delta on k8s cluster on GCP
Which Delta project/connector is this regarding?
Describe the problem
I am attempting to execute the PySpark code, which involves reading and writing to Delta on a GCS bucket. I am using the k8s cluster and running the spark-submit command.
For normal PySpark code involving Parquet files. It is a working file but for delta, I am getting the following error. I have been trying this with multiple different versions for the last week.
I am using the following spark_submit command
PySpark code snippet
Environment information
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?