Closed rongou closed 3 years ago
mark as p1 to analyze
This is really mind boggling SparkException is in the same maven package and hence the same jar as the code that is trying to load it. It looks like the threads context class loader must have really been messed up some how. I'll try and reproduce this at a smaller scale and see if I can make it work.
REPL has been more difficult to get to work in the prototype phase. Typically either the classloader was not set up in time before deeserialization or something is shimed while it should not be.
Hmm, I think both this issue and #3468 are caused by using spark shell. Switching to spark submit seems to work.
I don't see how #3468 works with spark-submit... that class isn't exposed so would expect failures, so perhaps you just means work for this issue?
If I switch to spark-submit, the query runs to completion, even after adding back GpuKryoRegistrator
. ¯_(ツ)_/¯
Here is the command:
/opt/spark/bin/spark-submit\
--master spark://127.0.0.1:7077\
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer\
--conf spark.kryoserializer.buffer=128m\
--conf spark.kryo.registrator=com.nvidia.spark.rapids.GpuKryoRegistrator\
--conf spark.locality.wait=0s\
--conf spark.sql.files.maxPartitionBytes=1g\
--conf spark.sql.shuffle.partitions=200\
--conf spark.sql.adaptive.enabled=true\
--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark312.RapidsShuffleManager\
--conf spark.shuffle.service.enabled=false\
--conf spark.dynamicAllocation.enabled=false\
--conf spark.sql.broadcastTimeout=600\
--conf spark.plugins=com.nvidia.spark.SQLPlugin\
--conf spark.rapids.cudfVersionOverride=true\
--conf spark.rapids.sql.concurrentGpuTasks=1\
--conf spark.rapids.memory.host.spillStorageSize=32G\
--conf spark.rapids.memory.pinnedPool.size=8G\
--conf spark.rapids.sql.batchSizeBytes=1g\
--conf spark.rapids.memory.gpu.direct.storage.spill.enabled=false\
--conf spark.rapids.memory.gpu.direct.storage.spill.useHostMemory=false\
--conf spark.rapids.memory.gpu.direct.storage.spill.alignedIO=false\
--conf spark.rapids.memory.gpu.direct.storage.spill.alignmentThreshold=8m\
--conf spark.rapids.memory.gpu.unspill.enabled=false\
--conf spark.rapids.shuffle.transport.enabled=true\
--conf spark.executorEnv.UCX_ERROR_SIGNALS=\
--conf spark.executorEnv.UCX_MEMTYPE_CACHE=n\
--conf spark.executorEnv.UCX_IB_RX_QUEUE_LEN=1024\
--conf spark.executorEnv.UCX_TLS=cuda_copy,cuda_ipc,rc,tcp\
--conf spark.executorEnv.UCX_RNDV_SCHEME=put_zcopy\
--conf spark.executorEnv.UCX_MAX_RNDV_RAILS=1\
--conf spark.rapids.shuffle.maxMetadataSize=512K\
--conf spark.rapids.shuffle.ucx.bounceBuffers.size=8M\
--conf spark.driver.memory=10G\
--conf spark.driver.maxResultSize=0\
--conf spark.driver.extraJavaOptions=-Dai.rapids.cudf.nvtx.enabled=false\
--conf spark.executor.extraClassPath=/opt/rapids/cudf.jar:/opt/rapids/rapids-4-spark.jar\
--conf spark.executor.extraJavaOptions=-Dai.rapids.cudf.nvtx.enabled=false\
--conf spark.executor.instances=1\
--conf spark.executor.cores=24\
--conf spark.executor.memory=64G\
--conf spark.executor.resource.gpu.amount=1\
--conf spark.task.cpus=1\
--conf spark.task.resource.gpu.amount=0.0416\
--jars /opt/rapids/cudf.jar,/opt/rapids/rapids-4-spark.jar\
--class com.nvidia.spark.rapids.tests.BenchmarkRunner\
/opt/rapids/rapids-4-spark-benchmarks.jar\
--benchmark tpcds\
--query q1\
--input /opt/data/tpcds/sf1000-parquet/useDecimal=false,useDate=true,filterNull=false\
--input-format parquet\
--summary-file-prefix tpcds-q1-gpu\
--iterations 1
thanks @rongou, this helps narrow it down.
I was able to reproduce something like this locally, but this is a different class that cannot be loaded.
org.apache.spark.SparkException: Job aborted due to stage failure: ClassNotFound with classloader: scala.tools.nsc.interpreter.IMain$TranslatingClassLoader@8aeab9e
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
at org.apache.spark.sql.rapids.execution.GpuBroadcastExchangeExecBase$$anon$1.$anonfun$call$2(GpuBroadcastExchangeExec.scala:307)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at org.apache.spark.sql.rapids.execution.GpuBroadcastExchangeExecBase.withResource(GpuBroadcastExchangeExec.scala:253)
at org.apache.spark.sql.rapids.execution.GpuBroadcastExchangeExecBase$$anon$1.$anonfun$call$1(GpuBroadcastExchangeExec.scala:301)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withExecutionId$1(SQLExecution.scala:139)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:137)
at org.apache.spark.sql.rapids.execution.GpuBroadcastExchangeExecBase$$anon$1.call(GpuBroadcastExchangeExec.scala:294)
at org.apache.spark.sql.rapids.execution.GpuBroadcastExchangeExecBase$$anon$1.call(GpuBroadcastExchangeExec.scala:290)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I should now be able to do some debugging.
So Spark was not including the actual class not found error in the error message. I hacked up Spark to do it and I found...
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.rapids.execution.SerializeBatchDeserializeHostBuffer
at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:72)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1986)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1850)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2048)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1655)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:103)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3.$anonfun$run$1(TaskResultGetter.scala:75)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:63)
... 3 more
So my guess right now is that the context class loader is not set properly for the driver thread that is trying to deserialize the broadcast data on the driver side. I am going to have to dig in and see how all of this is set.
I think I found a fix. The issue was that the tmpClassLoader
was not good enough. It works just fine when you go through the front door to our plugin. But java serialization does not go through that front door, so it could not find what it needed (the class to deserialize into as a part of a broadcast). I was able to find a way to update the scala repl class loader on the driver side similar to how we update the ExecutorClassLoader
on the executor.
Describe the bug Running tpcds query fails with
Steps/Code to reproduce bug Running a tpcds query in a spark shell:
The script:
Expected behavior Should not fail.
Environment details (please complete the following information)
20210913.150557-51
Additional context Full log: