Open obedmr opened 1 year ago
Hey Obed, this is a known issue in Dataproc integration with Gluten, the fix is already in progress in Dataproc and will be released soon. Will keep you updated.
Thank you for the info Meha (@meharanjan318)! Would you happen to have a prospective release date where this would be fixed?
@obedmr This is fixed now. Thanks for reporting! Please give it a try. cc: @meharanjan318
Hi @surnaik, @meharanjan318 We are getting the same error, using DP 2.1 with gluten-velox-bundle-spark3.3_2.12-1.1.1.jar.
Hi @noamzz , Thanks for reporting, could you please tell me the full image version used. Will help me verify if the fix has gone through. Also, please paste the exception, just checking if it's the same issue or slightly different.
Thanks @surnaik ! We are using DP version 2-1-35-debian11
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 10 times, most recent failure: Lost task 5.9 in stage 0.0 (TID 876): java.lang.NoSuchMethodError: 'void org.apache.spark.shuffle.IndexShuffleBlockResolver.writeMetadataFileAndCommit(int, long, long[], long[], java.io.File)'
at org.apache.spark.shuffle.ColumnarShuffleWriter.internalWrite(ColumnarShuffleWriter.scala:219)
at org.apache.spark.shuffle.ColumnarShuffleWriter.write(ColumnarShuffleWriter.scala:235)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1505)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2717)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2653)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2652)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2652)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1189)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1189)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1189)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2913)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2855)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2844)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: java.lang.NoSuchMethodError: 'void org.apache.spark.shuffle.IndexShuffleBlockResolver.writeMetadataFileAndCommit(int, long, long[], long[], java.io.File)'
at org.apache.spark.shuffle.ColumnarShuffleWriter.internalWrite(ColumnarShuffleWriter.scala:219)
at org.apache.spark.shuffle.ColumnarShuffleWriter.write(ColumnarShuffleWriter.scala:235)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1505)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)```
I checked internally and the image version used doesn't contain the fix, the fix went into 2.1.36 onwards. Could you please recreate a new cluster with the latest released image and give it a try.
Thanks @surnaik This issue is resolved, but we face some others :)
Backend
VL (Velox)
Bug description
We’re getting into a weird situation with a method (
org.apache.spark.shuffle.IndexShuffleBlockResolver.writeMetadataFileAndCommit
) that appears to be missing when we run the TPC-H in Dataproc, but when we go to the code, it’s actually implemented and it seems like the method appears in the *.class from the gluten jar (but maybe it’s a broken link)I'm attaching the log file spark_gluten_issue.txt
We're running from the Jupyter Notebooks environment in Dataproc
Spark version
Spark-3.3.x
Spark configurations
System information
Relevant logs