NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
822 stars 235 forks source link

[FEA] Print out the size of the batch currently being processed for GPU OOM. #11732

Open firestarman opened 2 days ago

firestarman commented 2 days ago

When getting GPU OOMs, we usually see error stacks as below. And it only tells where the OOM happens.

It would be better to also know the size of the batch currently being processed for the error triage.

com.nvidia.spark.rapids.jni.GpuSplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:458)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:588)
com.nvidia.spark.rapids.jni.GpuRetryOOM: GPU OutOfMemory
        at ai.rapids.cudf.Table.contiguousSplit(Native Method)
        at ai.rapids.cudf.Table.contiguousSplit(Table.java:2766)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.$anonfun$splitSpillableInHalfByRows$4(RmmRapidsRetryIterator.scala:681)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)