NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
823 stars 236 forks source link

[BUG] Delta Lake tables with name mapping can throw exceptions on read #11201

Closed jlowe closed 4 months ago

jlowe commented 4 months ago

Describe the bug A Delta Lake table configured for name mapping that contains one or more Parquet files that are missing field IDs on a column can throw an exception like the following:

java.lang.RuntimeException: Spark read schema expects field Ids, but Parquet file schema doesn't contain any field Ids.
Please remove the field ids from Spark schema or ignore missing ids by setting `spark.sql.parquet.fieldId.read.ignoreMissing = true`

Spark read schema:
{
  "type" : "struct",
  "fields" : [ {
    "name" : "id",
    "type" : "long",
    "nullable" : true,
    "metadata" : {
      "delta.columnMapping.id" : 1,
      "delta.columnMapping.physicalName" : "id",
      "parquet.field.id" : 1
    }
  } ]
}

Parquet file schema:
message spark_schema {
  required int64 id;
}

    at com.nvidia.spark.rapids.shims.ParquetSchemaClipShims$.checkIgnoreMissingIds(ParquetSchemaClipShims.scala:72)
    at com.nvidia.spark.rapids.GpuParquetFileFilterHandler.$anonfun$filterBlocks$1(GpuParquetScan.scala:726)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
    at com.nvidia.spark.rapids.GpuParquetFileFilterHandler.filterBlocks(GpuParquetScan.scala:676)

Steps/Code to reproduce bug

spark.conf.set("spark.rapids.sql.enabled", "false")
spark.conf.set("spark.sql.parquet.fieldId.write.enabled", "false")
spark.range(10).coalesce(1).write.parquet("/tmp/p")
spark.conf.set("spark.sql.parquet.fieldId.write.enabled", "true")
sql("CONVERT TO DELTA parquet.`/tmp/p`")
sql("ALTER TABLE delta.`/tmp/p` SET TBLPROPERTIES ('delta.minReaderVersion' = '2', 'delta.minWriterVersion' = '5', 'delta.columnMapping.mode' = 'name')")
spark.conf.set("spark.rapids.sql.enabled", "true")
spark.read.format("delta").load("/tmp/p").collect

Expected behavior Table contents should be read without an error.

jlowe commented 4 months ago

The problem occurs because GpuDeltaParquetFileFormat is missing corresponding changes in DeltaParquetFileFormat that were added in this Delta Lake commit. Field IDs are explicitly removed from the schema when name column mapping is used.