We met the OOM error as below in some customer queries, which led to the Spark task retries.
1) Caused by: com.nvidia.spark.rapids.jni.GpuSplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$NoInputSpliterator.split(RmmRapidsRetryIterator.scala:386)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:588)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:291)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:185)
at com.nvidia.spark.rapids.cudf_utils.HostConcatResultUtil$.getColumnarBatch(HostConcatResultUtil.scala:54)
at com.nvidia.spark.rapids.GpuShuffleCoalesceIterator.$anonfun$next$4(GpuShuffleCoalesceExec.scala:229)
The process of moving to coalesced buffer to GPU currently support only retry with no split, however we can try to implement the split-and-retry to improve its stability.
We met the OOM error as below in some customer queries, which led to the Spark task retries.
The process of moving to coalesced buffer to GPU currently support only retry with no split, however we can try to implement the split-and-retry to improve its stability.