NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
825 stars 236 forks source link

[BUG] Unable to read DeltaTable with columnMapping.mode = name #9255

Closed LIN-Yu-Ting closed 1 year ago

LIN-Yu-Ting commented 1 year ago

Describe the bug We would like to apply Spark-Rapids (23.08) plugin in Spark 3.3.0 environment in order to accelerate execution of SparkSQL with DeltaTable (2.3.0). However, we just discovered that Spark Rapids is not able to obtain data from a DeltaTable with following TBLPROPERTIES

+-------------------------------+-----+
|key                            |value|
+-------------------------------+-----+
|delta.columnMapping.maxColumnId|137  |
|delta.columnMapping.mode       |name |
|delta.minReaderVersion         |2    |
|delta.minWriterVersion         |5    |
+-------------------------------+-----+

Steps/Code to reproduce bug Please provide a list of steps or a code sample to reproduce the issue. You can create a DeltaTable with commands

df = spark.read.format("").load("load any of csv of json data")
spark.sql("SET spark.databricks.delta.properties.defaults.columnMapping.mode = name")
spark.sql("SET spark.databricks.delta.properties.defaults.minReaderVersion = 2")
spark.sql("SET spark.databricks.delta.properties.defaults.minWriterVersion = 5")
df.write.format("delta").saveAsTable("demo4")

Expected behavior A clear and concise description of what you expected to happen. Once you create the delta table, you can use the following command to reproduce the error.

spark.sql("SELECT * FROM demo4").show()

Then, you might obtain results such as

+----+----+
| _c0| _c1|
+----+----+
|null|null|
|null|null|
|null|null|
|null|null|
|null|null|
|null|null|
|null|null|
|null|null|
|null|null|
|null|null|
|null|null|
|null|null|
|null|null|
|null|null|
|null|null|
|null|null|
|null|null|
|null|null|
|null|null|
|null|null|
+----+----+
only showing top 20 rows

Environment details (please complete the following information)

Additional context Add any other context about the problem here.

jlowe commented 1 year ago

Thanks for the detailed report, @LIN-Yu-Ting! I can reproduce the issue, looking into it now.

jlowe commented 1 year ago

The problem occurs because the RAPIDS Accelerator does not have a specific override for DeltaParquetFileFormat which implements the column mapping mode on reads. DeltaParquetFileFormat derives from Apache Spark's ParquetFileFormat, and the RAPIDS Accelerator incorrectly thinks it can implement the behavior since DeltaParquetFileFormat is an instance of ParquetFileFormat.

The plugin will need to recognize DeltaParquetFileFormat directly and replace it with equivalent functionality for the GPU that will implement the column mapping feature.

LIN-Yu-Ting commented 1 year ago

@jlowe Thanks for your investigation of this issue. Is this implementation to DeltaParquetFileFormat difficult ? And do you have a plan to override reading function of Spark Rapids for DeltaParquetFileFormat ? If yes, can I expect that this function will be included in the next release 23.10 ?

jlowe commented 1 year ago

I'm working on the fix for this now, hope to have a PR up soon to have this fixed in 23.10.

LIN-Yu-Ting commented 1 year ago

OK, thanks for your efforts. If you don't mind, please let me know your branch of development. I will also be interested at how to fix this kind of issue. Thanks.

jlowe commented 1 year ago

@LIN-Yu-Ting I just posted #9279 to fix this problem. The fix involved updating the CPU format checks, so it includes a number of changes not specific to Delta Lake support.