When getting GPU OOMs, we usually see error stacks as below. And it only tells where the OOM happens.
It would be better to also know the size of the batch currently being processed for the error triage.
com.nvidia.spark.rapids.jni.GpuSplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:458)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:588)
com.nvidia.spark.rapids.jni.GpuRetryOOM: GPU OutOfMemory
at ai.rapids.cudf.Table.contiguousSplit(Native Method)
at ai.rapids.cudf.Table.contiguousSplit(Table.java:2766)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.$anonfun$splitSpillableInHalfByRows$4(RmmRapidsRetryIterator.scala:681)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
When getting GPU OOMs, we usually see error stacks as below. And it only tells where the OOM happens.
It would be better to also know the size of the batch currently being processed for the error triage.