NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
783 stars 228 forks source link

[BUG] NDS power run hits GPU OOM on Databricks. #8240

Closed res-life closed 1 year ago

res-life commented 1 year ago

Describe the bug Hit GPU OOM when running NDS power run on Databricks.

com.nvidia.spark.rapids.jni.SplitAndRetryOOM: GPU OutOfMemory
    at ai.rapids.cudf.Table.contiguousSplit(Native Method)
    at ai.rapids.cudf.Table.contiguousSplit(Table.java:2171)
    at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$2(SpillableColumnarBatch.scala:210)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$1(SpillableColumnarBatch.scala:209)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at com.nvidia.spark.rapids.SpillableColumnarBatch$.addBatch(SpillableColumnarBatch.scala:192)
    at com.nvidia.spark.rapids.SpillableColumnarBatch$.apply(SpillableColumnarBatch.scala:142)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator$AggHelper.preProcess(aggregate.scala:275)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator$.$anonfun$computeAggregateAndClose$1(aggregate.scala:421)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator$.computeAggregateAndClose(aggregate.scala:416)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.aggregateInputBatches(aggregate.scala:604)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$next$2(aggregate.scala:556)
    at scala.Option.getOrElse(Option.scala:189)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:553)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:498)
    at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:318)
    at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:340)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2(RapidsShuffleInternalManagerBase.scala:281)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2$adapted(RapidsShuffleInternalManagerBase.scala:274)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1(RapidsShuffleInternalManagerBase.scala:274)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1$adapted(RapidsShuffleInternalManagerBase.scala:273)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.write(RapidsShuffleInternalManagerBase.scala:273)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:81)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:81)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156)
    at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.Task.run(Task.scala:95)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:832)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1681)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:835)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:690)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

Steps/Code to reproduce bug Git check branch-22.06; Compile DB 321 jar Start DB cluster with i3.2xlarge + 4 * g4dn.4xlarge instances Configurations are:

spark.task.resource.gpu.amount 0.125
spark.shuffle.manager com.nvidia.spark.rapids.spark321db.RapidsShuffleManager
spark.hadoop.fs.s3a.access.key {{secrets/chongg-s3/access_key}}
spark.plugins com.nvidia.spark.SQLPlugin
spark.locality.wait 0s
spark.rapids.alluxio.automount.enabled false
spark.rapids.filecache.enabled true            ################ enable filecache ##############
spark.rapids.memory.pinnedPool.size 4G
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.secret.key {{secrets/chongg-s3/secret_access_key}}
spark.sql.files.maxPartitionBytes 2G
spark.rapids.sql.multiThreadedRead.numThreads 100
spark.rapids.sql.concurrentGpuTasks 2

Docker is: gaochong365/rapids-4-spark-databricks:23.04.0-rc1 Zone is us-west-2a Open an notebook and run NDS:

%python
spark.conf.get("spark.rapids.filecache.enabled")  // print true
spark.conf.get("spark.rapids.alluxio.automount.enabled") // Alluxio is disabled
spark.conf.set("spark.rapids.alluxio.automount.enabled", "false")
spark.sparkContext.addPyFile("s3://chongg/test/nds2.zip")
from nds2 import nds_power
nds_power.run_query_stream_on_EKS()

Environment details (please complete the following information)

Additional context DB version is: 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12) Details logs:

====== Creating TempView for table customer_address ======
Time taken: 8242 millis for table customer_address
====== Creating TempView for table customer_demographics ======
Time taken: 523 millis for table customer_demographics
====== Creating TempView for table date_dim ======
Time taken: 3284 millis for table date_dim
====== Creating TempView for table warehouse ======
Time taken: 3166 millis for table warehouse
====== Creating TempView for table ship_mode ======
Time taken: 378 millis for table ship_mode
====== Creating TempView for table time_dim ======
Time taken: 3249 millis for table time_dim
====== Creating TempView for table reason ======
Time taken: 355 millis for table reason
====== Creating TempView for table income_band ======
Time taken: 455 millis for table income_band
====== Creating TempView for table item ======
Time taken: 502 millis for table item
====== Creating TempView for table store ======
Time taken: 493 millis for table store
====== Creating TempView for table call_center ======
Time taken: 414 millis for table call_center
====== Creating TempView for table customer ======
Time taken: 442 millis for table customer
====== Creating TempView for table web_site ======
Time taken: 381 millis for table web_site
====== Creating TempView for table store_returns ======
Time taken: 4769 millis for table store_returns
====== Creating TempView for table household_demographics ======
Time taken: 360 millis for table household_demographics
====== Creating TempView for table web_page ======
Time taken: 336 millis for table web_page
====== Creating TempView for table promotion ======
Time taken: 419 millis for table promotion
====== Creating TempView for table catalog_page ======
Time taken: 431 millis for table catalog_page
====== Creating TempView for table inventory ======
Time taken: 1025 millis for table inventory
====== Creating TempView for table catalog_returns ======
Time taken: 4261 millis for table catalog_returns
====== Creating TempView for table web_returns ======
Time taken: 4430 millis for table web_returns
====== Creating TempView for table web_sales ======
Time taken: 3767 millis for table web_sales
====== Creating TempView for table catalog_sales ======
Time taken: 3796 millis for table catalog_sales
====== Creating TempView for table store_sales ======
Time taken: 3679 millis for table store_sales
====== Run query96 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [48518] millis for query96
====== Run query7 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [30427] millis for query7
====== Run query75 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [98609] millis for query75
====== Run query44 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [6147] millis for query44
====== Run query39_part1 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [8311] millis for query39_part1
====== Run query39_part2 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [5588] millis for query39_part2
====== Run query80 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [31311] millis for query80
====== Run query32 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [4926] millis for query32
====== Run query19 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [6325] millis for query19
====== Run query25 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [11260] millis for query25
====== Run query78 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [140475] millis for query78
====== Run query86 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [4918] millis for query86
====== Run query1 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [7598] millis for query1
====== Run query91 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [4085] millis for query91
====== Run query21 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [3167] millis for query21
====== Run query43 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [6280] millis for query43
====== Run query27 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [15975] millis for query27
====== Run query94 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [62340] millis for query94
====== Run query45 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [4556] millis for query45
====== Run query58 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [4456] millis for query58
====== Run query64 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [135912] millis for query64
====== Run query36 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [13811] millis for query36
====== Run query33 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [4757] millis for query33
====== Run query46 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [12847] millis for query46
====== Run query62 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [15950] millis for query62
====== Run query16 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [7796] millis for query16
====== Run query10 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [6210] millis for query10
====== Run query63 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [9824] millis for query63
====== Run query69 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [6060] millis for query69
====== Run query60 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [6260] millis for query60
====== Run query59 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [19820] millis for query59
====== Run query37 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [16616] millis for query37
====== Run query98 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [4365] millis for query98
====== Run query85 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [16484] millis for query85
====== Run query70 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [11686] millis for query70
====== Run query67 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [646572] millis for query67
====== Run query28 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [105924] millis for query28
====== Run query81 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [9061] millis for query81
====== Run query97 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [25135] millis for query97
====== Run query66 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [17316] millis for query66
====== Run query90 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [46541] millis for query90
====== Run query17 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [10266] millis for query17
====== Run query47 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [14113] millis for query47
====== Run query95 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [67166] millis for query95
====== Run query92 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [3941] millis for query92
====== Run query3 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [6169] millis for query3
====== Run query51 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [12093] millis for query51
====== Run query35 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [8164] millis for query35
====== Run query49 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [24913] millis for query49
====== Run query9 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
An error occurred while calling o73219.save.
: org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.rapids.GpuFileFormatWriter$.write(GpuFileFormatWriter.scala:284)
    at org.apache.spark.sql.rapids.GpuInsertIntoHadoopFsRelationCommand.runColumnar(GpuInsertIntoHadoopFsRelationCommand.scala:184)
    at com.nvidia.spark.rapids.GpuDataWritingCommandExec.sideEffectResult$lzycompute(GpuDataWritingCommandExec.scala:117)
    at com.nvidia.spark.rapids.GpuDataWritingCommandExec.sideEffectResult(GpuDataWritingCommandExec.scala:116)
    at com.nvidia.spark.rapids.GpuDataWritingCommandExec.internalDoExecuteColumnar(GpuDataWritingCommandExec.scala:140)
    at com.nvidia.spark.rapids.GpuExec.doExecuteColumnar(GpuExec.scala:351)
    at com.nvidia.spark.rapids.GpuExec.doExecuteColumnar$(GpuExec.scala:350)
    at com.nvidia.spark.rapids.GpuDataWritingCommandExec.doExecuteColumnar(GpuDataWritingCommandExec.scala:112)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:253)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:270)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:266)
    at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:249)
    at com.nvidia.spark.rapids.GpuColumnarToRowExec.doExecute(GpuColumnarToRowExec.scala:333)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:226)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:270)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:266)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:222)
    at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:78)
    at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:87)
    at org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:75)
    at org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:62)
    at org.apache.spark.sql.execution.ResultCacheManager.collectResult$1(ResultCacheManager.scala:575)
    at org.apache.spark.sql.execution.ResultCacheManager.computeResult(ResultCacheManager.scala:582)
    at org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:528)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:527)
    at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:424)
    at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:403)
    at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:424)
    at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:400)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$1(QueryExecution.scala:160)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:239)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:386)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:186)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:968)
    at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:141)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:336)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:160)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:156)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:590)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:168)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:590)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:268)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:264)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:566)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$eagerlyExecuteCommands$1(QueryExecution.scala:156)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:324)
    at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:156)
    at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:141)
    at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:132)
    at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:186)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:959)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:427)
    at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:396)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:250)
    at sun.reflect.GeneratedMethodAccessor352.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
    at py4j.Gateway.invoke(Gateway.java:295)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:251)
    at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:446)
    at org.apache.spark.sql.execution.SubqueryExec.executeCollect(basicPhysicalOperators.scala:954)
    at org.apache.spark.sql.rapids.GpuScalarSubquery.updateResult(GpuScalarSubquery.scala:49)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:308)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:307)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at org.apache.spark.sql.execution.SparkPlan.waitForSubqueries(SparkPlan.scala:307)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:269)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:266)
    at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:249)
    at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$doExecuteColumnar$1(AdaptiveSparkPlanExec.scala:517)
    at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:503)
    at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.doExecuteColumnar(AdaptiveSparkPlanExec.scala:517)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:253)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:270)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:266)
    at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:249)
    at org.apache.spark.sql.rapids.GpuFileFormatWriter$.write(GpuFileFormatWriter.scala:211)
    ... 69 more
Caused by: java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 142 in stage 1089.0 failed 4 times, most recent failure: Lost task 142.3 in stage 1089.0 (TID 24175) (10.59.240.71 executor 0): com.nvidia.spark.rapids.jni.SplitAndRetryOOM: GPU OutOfMemory
    at ai.rapids.cudf.Table.contiguousSplit(Native Method)
    at ai.rapids.cudf.Table.contiguousSplit(Table.java:2171)
    at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$2(SpillableColumnarBatch.scala:210)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$1(SpillableColumnarBatch.scala:209)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at com.nvidia.spark.rapids.SpillableColumnarBatch$.addBatch(SpillableColumnarBatch.scala:192)
    at com.nvidia.spark.rapids.SpillableColumnarBatch$.apply(SpillableColumnarBatch.scala:142)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator$AggHelper.preProcess(aggregate.scala:275)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator$.$anonfun$computeAggregateAndClose$1(aggregate.scala:421)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator$.computeAggregateAndClose(aggregate.scala:416)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.aggregateInputBatches(aggregate.scala:604)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$next$2(aggregate.scala:556)
    at scala.Option.getOrElse(Option.scala:189)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:553)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:498)
    at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:318)
    at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:340)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2(RapidsShuffleInternalManagerBase.scala:281)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2$adapted(RapidsShuffleInternalManagerBase.scala:274)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1(RapidsShuffleInternalManagerBase.scala:274)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1$adapted(RapidsShuffleInternalManagerBase.scala:273)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.write(RapidsShuffleInternalManagerBase.scala:273)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:81)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:81)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156)
    at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.Task.run(Task.scala:95)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:832)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1681)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:835)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:690)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
    at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
    at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:438)
    ... 90 more
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 142 in stage 1089.0 failed 4 times, most recent failure: Lost task 142.3 in stage 1089.0 (TID 24175) (10.59.240.71 executor 0): com.nvidia.spark.rapids.jni.SplitAndRetryOOM: GPU OutOfMemory
    at ai.rapids.cudf.Table.contiguousSplit(Native Method)
    at ai.rapids.cudf.Table.contiguousSplit(Table.java:2171)
    at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$2(SpillableColumnarBatch.scala:210)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$1(SpillableColumnarBatch.scala:209)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at com.nvidia.spark.rapids.SpillableColumnarBatch$.addBatch(SpillableColumnarBatch.scala:192)
    at com.nvidia.spark.rapids.SpillableColumnarBatch$.apply(SpillableColumnarBatch.scala:142)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator$AggHelper.preProcess(aggregate.scala:275)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator$.$anonfun$computeAggregateAndClose$1(aggregate.scala:421)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator$.computeAggregateAndClose(aggregate.scala:416)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.aggregateInputBatches(aggregate.scala:604)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$next$2(aggregate.scala:556)
    at scala.Option.getOrElse(Option.scala:189)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:553)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:498)
    at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:318)
    at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:340)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2(RapidsShuffleInternalManagerBase.scala:281)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2$adapted(RapidsShuffleInternalManagerBase.scala:274)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1(RapidsShuffleInternalManagerBase.scala:274)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1$adapted(RapidsShuffleInternalManagerBase.scala:273)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.write(RapidsShuffleInternalManagerBase.scala:273)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:81)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:81)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156)
    at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.Task.run(Task.scala:95)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:832)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1681)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:835)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:690)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3088)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3035)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3029)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3029)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1391)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1391)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1391)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3297)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3238)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3226)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: com.nvidia.spark.rapids.jni.SplitAndRetryOOM: GPU OutOfMemory
    at ai.rapids.cudf.Table.contiguousSplit(Native Method)
    at ai.rapids.cudf.Table.contiguousSplit(Table.java:2171)
    at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$2(SpillableColumnarBatch.scala:210)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$1(SpillableColumnarBatch.scala:209)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at com.nvidia.spark.rapids.SpillableColumnarBatch$.addBatch(SpillableColumnarBatch.scala:192)
    at com.nvidia.spark.rapids.SpillableColumnarBatch$.apply(SpillableColumnarBatch.scala:142)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator$AggHelper.preProcess(aggregate.scala:275)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator$.$anonfun$computeAggregateAndClose$1(aggregate.scala:421)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator$.computeAggregateAndClose(aggregate.scala:416)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.aggregateInputBatches(aggregate.scala:604)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$next$2(aggregate.scala:556)
    at scala.Option.getOrElse(Option.scala:189)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:553)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:498)
    at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:318)
    at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:340)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2(RapidsShuffleInternalManagerBase.scala:281)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2$adapted(RapidsShuffleInternalManagerBase.scala:274)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1(RapidsShuffleInternalManagerBase.scala:274)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1$adapted(RapidsShuffleInternalManagerBase.scala:273)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.write(RapidsShuffleInternalManagerBase.scala:273)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:81)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:81)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156)
    at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.Task.run(Task.scala:95)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:832)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1681)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:835)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:690)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

Time taken: [34695] millis for query9
====== Run query31 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [14528] millis for query31
====== Run query11 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [62402] millis for query11
====== Run query93 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [303393] millis for query93
====== Run query29 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [30185] millis for query29
====== Run query38 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [13714] millis for query38
====== Run query22 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [5826] millis for query22
====== Run query89 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [9580] millis for query89
====== Run query15 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [4871] millis for query15
====== Run query6 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [3484] millis for query6
====== Run query52 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [3987] millis for query52
====== Run query50 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [165529] millis for query50
====== Run query42 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [3985] millis for query42
====== Run query41 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [2328] millis for query41
====== Run query8 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [4331] millis for query8
====== Run query12 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [3327] millis for query12
====== Run query20 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [3190] millis for query20
====== Run query88 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [119017] millis for query88
====== Run query82 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [28516] millis for query82
====== Run query23_part1 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [52752] millis for query23_part1
====== Run query23_part2 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [52281] millis for query23_part2
====== Run query14_part1 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [59121] millis for query14_part1
====== Run query14_part2 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [45100] millis for query14_part2
====== Run query57 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [9789] millis for query57
====== Run query65 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [37712] millis for query65
====== Run query71 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [7045] millis for query71
====== Run query34 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [6012] millis for query34
====== Run query48 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [11606] millis for query48
====== Run query30 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [7890] millis for query30
====== Run query74 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [30204] millis for query74
====== Run query87 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [14858] millis for query87
====== Run query77 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [7240] millis for query77
====== Run query73 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [4219] millis for query73
====== Run query84 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [8392] millis for query84
====== Run query54 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [5990] millis for query54
====== Run query55 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [3284] millis for query55
====== Run query56 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
Time taken: [4398] millis for query56
====== Run query2 ======
Not found com.nvidia.spark.rapids.listener.Manager 'JavaPackage' object is not callable
res-life commented 1 year ago

It's not related to file cache. It also occurs when file cache is disabled and Alluxio is enabled. Also occurs at query9.

revans2 commented 1 year ago

We increased the memory pressure in some places when trying to add in retry work. #7672 should fix the issue and is currently being worked on.

mattahrens commented 1 year ago

I confirmed that this is reproducible without Alluxio and without file caching enabled.

tgravescs commented 1 year ago

databricks by default is using cuda 11.4 so we don't have the async allocator there. I have seen OOM's sometimes in the past due to fragementation.

mattahrens commented 1 year ago

I validated that the i3.2xlarge + 4 * g4dn.4xlarge NDS SF3K benchmark will succeed without any query failures with a test jar with this fix: https://github.com/NVIDIA/spark-rapids/issues/7672

mattahrens commented 1 year ago

Validated on latest snapshot jar with NDS SF3K benchmark with i3.2xlarge + 4 * g4dn.4xlarge cluster that no query failures exist.

Results:

====== Run query96 ======
TaskFailureListener is registered.
Time taken: [19880] millis for query96
====== Run query7 ======
TaskFailureListener is registered.
Time taken: [12176] millis for query7
====== Run query75 ======
TaskFailureListener is registered.
Time taken: [49120] millis for query75
====== Run query44 ======
TaskFailureListener is registered.
Time taken: [4414] millis for query44
====== Run query39_part1 ======
TaskFailureListener is registered.
Time taken: [5818] millis for query39_part1
====== Run query39_part2 ======
TaskFailureListener is registered.
Time taken: [3724] millis for query39_part2
====== Run query80 ======
TaskFailureListener is registered.
Time taken: [25905] millis for query80
====== Run query32 ======
TaskFailureListener is registered.
Time taken: [4189] millis for query32
====== Run query19 ======
TaskFailureListener is registered.
Time taken: [4544] millis for query19
====== Run query25 ======
TaskFailureListener is registered.
Time taken: [7889] millis for query25
====== Run query78 ======
TaskFailureListener is registered.
Time taken: [106146] millis for query78
====== Run query86 ======
TaskFailureListener is registered.
Time taken: [3474] millis for query86
====== Run query1 ======
TaskFailureListener is registered.
Time taken: [5370] millis for query1
====== Run query91 ======
TaskFailureListener is registered.
Time taken: [3278] millis for query91
====== Run query21 ======
TaskFailureListener is registered.
Time taken: [1612] millis for query21
====== Run query43 ======
TaskFailureListener is registered.
Time taken: [4416] millis for query43
====== Run query27 ======
TaskFailureListener is registered.
Time taken: [9193] millis for query27
====== Run query94 ======
TaskFailureListener is registered.
Time taken: [23573] millis for query94
====== Run query45 ======
TaskFailureListener is registered.
Time taken: [2857] millis for query45
====== Run query58 ======
TaskFailureListener is registered.
Time taken: [2903] millis for query58
====== Run query64 ======
TaskFailureListener is registered.
Time taken: [78379] millis for query64
====== Run query36 ======
TaskFailureListener is registered.
Time taken: [7803] millis for query36
====== Run query33 ======
TaskFailureListener is registered.
Time taken: [4199] millis for query33
====== Run query46 ======
TaskFailureListener is registered.
Time taken: [9450] millis for query46
====== Run query62 ======
TaskFailureListener is registered.
Time taken: [8595] millis for query62
====== Run query16 ======
TaskFailureListener is registered.
Time taken: [4460] millis for query16
====== Run query10 ======
TaskFailureListener is registered.
Time taken: [5022] millis for query10
====== Run query63 ======
TaskFailureListener is registered.
Time taken: [5986] millis for query63
====== Run query69 ======
TaskFailureListener is registered.
Time taken: [3991] millis for query69
====== Run query60 ======
TaskFailureListener is registered.
Time taken: [5732] millis for query60
====== Run query59 ======
TaskFailureListener is registered.
Time taken: [15394] millis for query59
====== Run query37 ======
TaskFailureListener is registered.
Time taken: [8350] millis for query37
====== Run query98 ======
TaskFailureListener is registered.
Time taken: [3504] millis for query98
====== Run query85 ======
TaskFailureListener is registered.
Time taken: [10754] millis for query85
====== Run query70 ======
TaskFailureListener is registered.
Time taken: [9281] millis for query70
====== Run query67 ======
TaskFailureListener is registered.
Time taken: [647998] millis for query67
====== Run query28 ======
TaskFailureListener is registered.
Time taken: [80697] millis for query28
====== Run query81 ======
TaskFailureListener is registered.
Time taken: [6388] millis for query81
====== Run query97 ======
TaskFailureListener is registered.
Time taken: [15391] millis for query97
====== Run query66 ======
TaskFailureListener is registered.
Time taken: [12784] millis for query66
====== Run query90 ======
TaskFailureListener is registered.
Time taken: [6108] millis for query90
====== Run query17 ======
TaskFailureListener is registered.
Time taken: [9418] millis for query17
====== Run query47 ======
TaskFailureListener is registered.
Time taken: [11350] millis for query47
====== Run query95 ======
TaskFailureListener is registered.
Time taken: [38455] millis for query95
====== Run query92 ======
TaskFailureListener is registered.
Time taken: [2664] millis for query92
====== Run query3 ======
TaskFailureListener is registered.
Time taken: [7839] millis for query3
====== Run query51 ======
TaskFailureListener is registered.
Time taken: [7959] millis for query51
====== Run query35 ======
TaskFailureListener is registered.
Time taken: [7043] millis for query35
====== Run query49 ======
TaskFailureListener is registered.
Time taken: [16491] millis for query49
====== Run query9 ======
TaskFailureListener is registered.
Time taken: [21549] millis for query9
====== Run query31 ======
TaskFailureListener is registered.
Time taken: [12092] millis for query31
====== Run query11 ======
TaskFailureListener is registered.
Time taken: [33898] millis for query11
====== Run query93 ======
TaskFailureListener is registered.
Time taken: [229439] millis for query93
====== Run query29 ======
TaskFailureListener is registered.
Time taken: [20076] millis for query29
====== Run query38 ======
TaskFailureListener is registered.
Time taken: [11556] millis for query38
====== Run query22 ======
TaskFailureListener is registered.
Time taken: [3954] millis for query22
====== Run query89 ======
TaskFailureListener is registered.
Time taken: [5318] millis for query89
====== Run query15 ======
TaskFailureListener is registered.
Time taken: [3497] millis for query15
====== Run query6 ======
TaskFailureListener is registered.
Time taken: [2011] millis for query6
====== Run query52 ======
TaskFailureListener is registered.
Time taken: [4053] millis for query52
====== Run query50 ======
TaskFailureListener is registered.
Time taken: [140514] millis for query50
====== Run query42 ======
TaskFailureListener is registered.
Time taken: [2867] millis for query42
====== Run query41 ======
TaskFailureListener is registered.
Time taken: [816] millis for query41
====== Run query8 ======
TaskFailureListener is registered.
Time taken: [3618] millis for query8
====== Run query12 ======
TaskFailureListener is registered.
Time taken: [1881] millis for query12
====== Run query20 ======
TaskFailureListener is registered.
Time taken: [1950] millis for query20
====== Run query88 ======
TaskFailureListener is registered.
Time taken: [66945] millis for query88
====== Run query82 ======
TaskFailureListener is registered.
Time taken: [14500] millis for query82
====== Run query23_part1 ======
TaskFailureListener is registered.
Time taken: [48910] millis for query23_part1
====== Run query23_part2 ======
TaskFailureListener is registered.
Time taken: [48231] millis for query23_part2
====== Run query14_part1 ======
TaskFailureListener is registered.
Time taken: [30410] millis for query14_part1
====== Run query14_part2 ======
TaskFailureListener is registered.
Time taken: [24861] millis for query14_part2
====== Run query57 ======
TaskFailureListener is registered.
Time taken: [7074] millis for query57
====== Run query65 ======
TaskFailureListener is registered.
Time taken: [31166] millis for query65
====== Run query71 ======
TaskFailureListener is registered.
Time taken: [9616] millis for query71
====== Run query34 ======
TaskFailureListener is registered.
Time taken: [4916] millis for query34
====== Run query48 ======
TaskFailureListener is registered.
Time taken: [6922] millis for query48
====== Run query30 ======
TaskFailureListener is registered.
Time taken: [6529] millis for query30
====== Run query74 ======
TaskFailureListener is registered.
Time taken: [22272] millis for query74
====== Run query87 ======
TaskFailureListener is registered.
Time taken: [12607] millis for query87
====== Run query77 ======
TaskFailureListener is registered.
Time taken: [7064] millis for query77
====== Run query73 ======
TaskFailureListener is registered.
Time taken: [3262] millis for query73
====== Run query84 ======
TaskFailureListener is registered.
Time taken: [4278] millis for query84
====== Run query54 ======
TaskFailureListener is registered.
Time taken: [5747] millis for query54
====== Run query55 ======
TaskFailureListener is registered.
Time taken: [2622] millis for query55
====== Run query56 ======
TaskFailureListener is registered.
Time taken: [4163] millis for query56
====== Run query2 ======
TaskFailureListener is registered.
Time taken: [13934] millis for query2
====== Run query26 ======
TaskFailureListener is registered.
Time taken: [6269] millis for query26
====== Run query40 ======
TaskFailureListener is registered.
Time taken: [11906] millis for query40
====== Run query72 ======
TaskFailureListener is registered.
Time taken: [39690] millis for query72
====== Run query53 ======
TaskFailureListener is registered.
Time taken: [4198] millis for query53
====== Run query79 ======
TaskFailureListener is registered.
Time taken: [8341] millis for query79
====== Run query18 ======
TaskFailureListener is registered.
Time taken: [11206] millis for query18
====== Run query13 ======
TaskFailureListener is registered.
Time taken: [9851] millis for query13
====== Run query24_part1 ======
TaskFailureListener is registered.
Time taken: [52977] millis for query24_part1
====== Run query24_part2 ======
TaskFailureListener is registered.
Time taken: [50473] millis for query24_part2
====== Run query4 ======
TaskFailureListener is registered.
Time taken: [100992] millis for query4
====== Run query99 ======
TaskFailureListener is registered.
Time taken: [16626] millis for query99
====== Run query68 ======
TaskFailureListener is registered.
Time taken: [5855] millis for query68
====== Run query83 ======
TaskFailureListener is registered.
Time taken: [2701] millis for query83
====== Run query61 ======
TaskFailureListener is registered.
Time taken: [5336] millis for query61
====== Run query5 ======
TaskFailureListener is registered.
Time taken: [8548] millis for query5
====== Run query76 ======
TaskFailureListener is registered.
Time taken: [26556] millis for query76
====== Power Test Time: 2690208 milliseconds ======
====== Total Time: 2858784 milliseconds ======