[BUG] OOM error in split and retry with multifile coalesce reader with parquet data

mattahrens commented 1 year ago

Customer encountered this OOM error:


23/06/29 06:34:56 WARN DeviceMemoryEventHandler: [RETRY 2] Retrying allocation of 12421594616 after a synchronize. Total RMM allocated is 5213155072 bytes.
23/06/29 06:34:56 INFO DeviceMemoryEventHandler: Device allocation of 12421594616 bytes failed, device store has 0 total and 0 spillable bytes. Attempt 2. Total RMM allocated is 5213155072 bytes. 
23/06/29 06:34:56 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 12421594616 bytes. Total RMM allocated is 5213155072 bytes.
23/06/29 06:34:56 ERROR Executor: Exception in task 49.3 in stage 10.0 (TID 560)
com.nvidia.spark.rapids.jni.SplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry
    at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:410)
    at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:520)
    at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:458)
    at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:275)
    at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:128)
    at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$1(GpuMultiFileReader.scala:1156)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.readBatch(GpuMultiFileReader.scala:1130)
    at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1115)
    at com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
    at com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
    at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.$anonfun$hasNext$1(GpuDataSourceRDD.scala:62)
    at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(GpuDataSourceRDD.scala:62)
    at scala.Option.exists(Option.scala:376)
    at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.hasNext(GpuDataSourceRDD.scala:62)
    at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.advanceToNextIter(GpuDataSourceRDD.scala:87)
    at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.hasNext(GpuDataSourceRDD.scala:62)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:477)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$hasNext$2(aggregate.scala:548)
    at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
    at scala.Option.getOrElse(Option.scala:189)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.hasNext(aggregate.scala:548)
    at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:317)
    at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:340)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2(RapidsShuffleInternalManagerBase.scala:281)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2$adapted(RapidsShuffleInternalManagerBase.scala:274)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1(RapidsShuffleInternalManagerBase.scala:274)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1$adapted(RapidsShuffleInternalManagerBase.scala:273)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.write(RapidsShuffleInternalManagerBase.scala:273)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:136)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:552)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1533)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:555)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
23/06/29 06:34:56 INFO CoarseGrainedExecutorBackend: Got assigned task 761
23/06/29 06:34:56 INFO Executor: Running task 48.3 in stage 10.0 (TID 761)
23/06/29 06:34:56 INFO GpuParquetMultiFilePartitionReaderFactory: Using the coalesce multi-file Parquet reader

mattahrens commented 1 year ago

Initial fix completed in 23.10: https://github.com/rapidsai/cudf/pull/14079. Next Phase tracked here: https://github.com/rapidsai/cudf/issues/14270.

mattahrens commented 3 months ago

What additional issue needs to be fixed? The OOM issues should all be resolved now.

From: Hongbin Ma (Mahone) @.> Sent: Monday, August 5, 2024 1:01 AM To: NVIDIA/spark-rapids @.> Cc: Matt Ahrens @.>; Mention @.> Subject: Re: [NVIDIA/spark-rapids] [BUG] OOM error in split and retry with multifile coalesce reader with parquet data (Issue #9060)

Initial fix completed in 23.10: rapidsai/cudf#14079https://github.com/rapidsai/cudf/pull/14079. Next Phase tracked here: rapidsai/cudf#14270https://github.com/rapidsai/cudf/issues/14270.

Hi @mattahrenshttps://github.com/mattahrens , now that rapidsai/cudf#14270https://github.com/rapidsai/cudf/issues/14270 is closed, do we have any plan for further fix?

— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/spark-rapids/issues/9060#issuecomment-2268237864, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABIO4V5K2ERF7ITEJOSMTOLZP4IK3AVCNFSM6AAAAABL7S3PE6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRYGIZTOOBWGQ. You are receiving this because you were mentioned.Message ID: @.***>

NVIDIA / spark-rapids

[BUG] OOM error in split and retry with multifile coalesce reader with parquet data #9060