NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
822 stars 235 forks source link

[BUG] GpuProjectExec doesn't allow a non-serializable child plan #8095

Open GaryShen2008 opened 1 year ago

GaryShen2008 commented 1 year ago

Describe the bug When testing Kyuubi spark authorization with Ranger, I got the below exception.

Caused by: java.io.NotSerializableException: org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog
Serialization stack:
    - object not serializable (class: org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog, value: V2SessionCatalog(spark_catalog))
    - field (class: org.apache.spark.sql.execution.datasources.v2.ShowNamespacesExec, name: catalog, type: interface org.apache.spark.sql.connector.catalog.SupportsNamespaces)
    - object (class org.apache.spark.sql.execution.datasources.v2.ShowNamespacesExec, ShowNamespaces [namespace#0], V2SessionCatalog(spark_catalog)
)
    - field (class: org.apache.kyuubi.plugin.spark.authz.ranger.FilteredShowNamespaceExec, name: delegated, type: class org.apache.spark.sql.execution.SparkPlan)
    - object (class org.apache.kyuubi.plugin.spark.authz.ranger.FilteredShowNamespaceExec, FilteredShowNamespace ShowNamespaces [namespace#0], V2SessionCatalog(spark_catalog)
)
    - field (class: com.nvidia.spark.rapids.GpuRowToColumnarExec, name: child, type: class org.apache.spark.sql.execution.SparkPlan)
    - object (class com.nvidia.spark.rapids.GpuRowToColumnarExec, GpuRowToColumnar targetsize(2147483647)
+- FilteredShowNamespace ShowNamespaces [namespace#0], V2SessionCatalog(spark_catalog)
)
    - field (class: com.nvidia.spark.rapids.GpuProjectExec, name: child, type: class org.apache.spark.sql.execution.SparkPlan)
    - object (class com.nvidia.spark.rapids.GpuProjectExec, GpuProject true
+- GpuRowToColumnar targetsize(2147483647)
   +- FilteredShowNamespace ShowNamespaces [namespace#0], V2SessionCatalog(spark_catalog)
)
    - element of array (index: 0)
    - array (class [Ljava.lang.Object;, size 6)
    - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
    - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class com.nvidia.spark.rapids.GpuProjectExec, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic com/nvidia/spark/rapids/GpuProjectExec.$anonfun$doExecuteColumnar$1:(Lcom/nvidia/spark/rapids/GpuProjectExec;Lcom/nvidia/spark/rapids/GpuMetric;Lcom/nvidia/spark/rapids/GpuMetric;Lscala/Option;Lcom/nvidia/spark/rapids/GpuMetric;Lscala/Option;Lorg/apache/spark/sql/vectorized/ColumnarBatch;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;, instantiatedMethodType=(Lorg/apache/spark/sql/vectorized/ColumnarBatch;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;, numCaptured=6])
    - writeReplace data (class: java.lang.invoke.SerializedLambda)
    - object (class com.nvidia.spark.rapids.GpuProjectExec$$Lambda$2951/1045016058, com.nvidia.spark.rapids.GpuProjectExec$$Lambda$2951/1045016058@6e774194)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:49)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:115)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:441)
    ... 61 more

Steps/Code to reproduce bug A local reproducible step as below. Launch a Spark shell with spark-rapids, run below code.

Seq(1,2,3,4,5).toDF("a").repartition(1).show

scala> Seq(1,2,3,4,5).toDF("a").repartition(1).show
23/04/13 13:08:42 WARN GpuOverrides: 
*Exec <ProjectExec> will run on GPU
  *Expression <Alias> cast(a#4 as string) AS a#7 will run on GPU
    *Expression <Cast> cast(a#4 as string) will run on GPU
  *Exec <ShuffleExchangeExec> will run on GPU
    *Partitioning <SinglePartition$> will run on GPU
    ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
      @Expression <AttributeReference> a#4 could run on GPU

Expected behavior Even the child plan isn't serializable, the job shouldn't fail.

Environment details (please complete the following information) Any

revans2 commented 1 year ago

Looks like we have a lambda inside of GPuProjectExec that is pulling in the ProjectExec itself. We likely have this problem in a lot of other places too.

abellina commented 1 year ago

They have shown that withResource was the culprit in this particular case, since they removed withResource and the project exec didn't need serializing. But I agree that there may be other things we are doing causing serialization to trigger in other places. Kyuubi did fix a similar issue, but it is in an unreleased version (https://github.com/apache/kyuubi/issues/4617).

abellina commented 1 year ago

I can take this on to be done by next sprint. Virtually all files need to change, and it doesn't seem to be a trivial script that could handle it. So it is just going to take time.

If it is for 23.04 and is really important it would be great to know. @sameerz @GaryShen2008

GaryShen2008 commented 1 year ago

This issue happened when using Kyuubi 1.7.0 with Ranger authorization. Kyuubi has fixed the serializable issue in their latest code, but it's not released yet. I hope to have a simple fixing in GpuProjectExec to make it unblocking our usage of Kyuubi 1.7.0 if possible. Otherwise we'll need to use their master-snapshot version to bypass this issue with our coming 23.04 release. And I don't know when Apache Kyuubi will release the next version. I hope we don't need to depend on that.

GaryShen2008 commented 1 year ago

Update one thing, I double tested the Kyuubi's master-snapshot image, the issue still occurred. In that case, it seems a MUST fix in our plugin side.

mattahrens commented 1 year ago

@GaryShen2008 can you confirm that the merged PR resolves the issue? If so, please close this issue.