NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
736 stars 219 forks source link

[BUG] Failed to read data from iceberg #10831

Closed wbo4958 closed 1 week ago

wbo4958 commented 2 weeks ago

Bug Desc

I tried to use spark-rapids to read data from iceberg, but failed with below exception no matter whether spark.rapids.sql.format.iceberg.enabled is true or flase or not.

24/05/17 10:50:40 ERROR GpuOverrideUtil: Encountered an exception applying GPU overrides java.lang.ClassCastException: org.apache.iceberg.BaseFileScanTask cannot be cast to org.apache.iceberg.CombinedScanTask
java.lang.ClassCastException: org.apache.iceberg.BaseFileScanTask cannot be cast to org.apache.iceberg.CombinedScanTask
    at com.nvidia.spark.rapids.iceberg.spark.source.GpuSparkBatchQueryScan.isMetadataScan(GpuSparkBatchQueryScan.java:92)
    at com.nvidia.spark.rapids.iceberg.IcebergProviderImpl$$anon$1.tagSelfForGpu(IcebergProviderImpl.scala:51)
    at com.nvidia.spark.rapids.RapidsMeta.tagForGpu(RapidsMeta.scala:318)
    at com.nvidia.spark.rapids.RapidsMeta.$anonfun$tagForGpu$1(RapidsMeta.scala:292)
    at com.nvidia.spark.rapids.RapidsMeta.$anonfun$tagForGpu$1$adapted(RapidsMeta.scala:292)

How to repro

prepare data for iceberg

spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2\
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.local.type=hadoop \
    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse \

and execute the below command

scala> spark.range(100).writeTo("local.db.demo").using("iceberg").create()

scala> spark.table("local.db.demo").show()
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+---+
only showing top 20 rows

run with spark-rapids

$SPARK_HOME/bin/spark-shell \
     --master "local[1]" \
     --driver-memory 2G \
     --conf spark.plugins=com.nvidia.spark.SQLPlugin \
     --jars /home/bobwang/jars/rapids-4-spark_2.12-24.04.0.jar \
     --conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
     --conf spark.rapids.sql.enabled=true\
     --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2\
     --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
     --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
     --conf spark.sql.catalog.spark_catalog.type=hive \
     --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
     --conf spark.sql.catalog.local.type=hadoop \
     --conf spark.sql.catalog.local.warehouse=$PWD/warehouse\
     --conf spark.rapids.sql.format.iceberg.enabled=false \

and execute

cala> spark.table("local.db.demo").show()
24/05/17 10:55:46 ERROR GpuOverrideUtil: Encountered an exception applying GPU overrides java.lang.ClassCastException: org.apache.iceberg.BaseFileScanTask cannot be cast to org.apache.iceberg.CombinedScanTask
java.lang.ClassCastException: org.apache.iceberg.BaseFileScanTask cannot be cast to org.apache.iceberg.CombinedScanTask
    at com.nvidia.spark.rapids.iceberg.spark.source.GpuSparkBatchQueryScan.isMetadataScan(GpuSparkBatchQueryScan.java:92)
    at com.nvidia.spark.rapids.iceberg.IcebergProviderImpl$$anon$1.tagSelfForGpu(IcebergProviderImpl.scala:51)
    at com.nvidia.spark.rapids.RapidsMeta.tagForGpu(RapidsMeta.scala:318)
    at com.nvidia.spark.rapids.RapidsMeta.$anonfun$tagForGpu$1(RapidsMeta.scala:292)
    at com.nvidia.spark.rapids.RapidsMeta.$anonfun$tagForGpu$1$adapted(RapidsMeta.scala:292)
    at scala.collection.immutable.List.foreach(List.scala:431)
    at com.nvidia.spark.rapids.RapidsMeta.tagForGpu(RapidsMeta.scala:292)
firestarman commented 2 weeks ago

So far, we only support Iceberg of v0.13.x, can you try this version ?