[BUG] ChunkedPacker can fail at construction or runtime if CUDA runs OOM for kernels

We recently saw what looks like a CUDA OOM (not your classic RMM OOM) with the stack below.

It looks like a kernel exclusive_scan_by_key failed to launch. This seems to be that we have a job where our reserve memory (640MB by default, and configured by spark.rapids.memory.gpu.reserve) is either not respected or not enough to launch all the kernels that are needed by CUDA. We think that we could have synchronized to force the ASYNC allocator to return to within its thresholds, but we are not sure what the guarantees are. We also need to repro this independently to properly handle and document it.

24/08/01 23:01:20 INFO DeviceMemoryEventHandler: Device allocation of 60625520 bytes failed, device store has 5003256884 total and 1560204129 spillable bytes. First attempt. Total RMM allocated is 9108818176 bytes. 
24/08/01 23:01:20 WARN RapidsDeviceMemoryStore: Targeting a device memory size of 1499578609. Current total 5003256884. Current spillable 1560204129
24/08/01 23:01:20 WARN RapidsDeviceMemoryStore: device memory store spilling to reduce usage from 5003256884 total (1560204129 spillable) to 1499578609 bytes 
24/08/01 23:01:20 ERROR DeviceMemoryEventHandler: Error handling allocation failure 
ai.rapids.cudf.CudfException: after dispatching exclusive_scan_by_key kernel: cudaErrorMemoryAllocation: out of memory 
    at ai.rapids.cudf.Table.makeChunkedPack(Native Method) 
    at ai.rapids.cudf.Table.makeChunkedPack(Table.java:2672) 
    at com.nvidia.spark.rapids.ChunkedPacker.$anonfun$chunkedPack$1(RapidsBuffer.scala:97) 
    at scala.Option.flatMap(Option.scala:271) 
    at com.nvidia.spark.rapids.ChunkedPacker.liftedTree1$1(RapidsBuffer.scala:96) 
    at com.nvidia.spark.rapids.ChunkedPacker.<init>(RapidsBuffer.scala:95) 
    at com.nvidia.spark.rapids.RapidsDeviceMemoryStore$RapidsTable.makeChunkedPacker(RapidsDeviceMemoryStore.scala:272) 
    at com.nvidia.spark.rapids.RapidsBufferCopyIterator.<init>(RapidsBuffer.scala:180) 
    at com.nvidia.spark.rapids.RapidsBuffer.getCopyIterator(RapidsBuffer.scala:248) 
    at com.nvidia.spark.rapids.RapidsBuffer.getCopyIterator$(RapidsBuffer.scala:247) 
    at com.nvidia.spark.rapids.RapidsBufferStore$RapidsBufferBase.getCopyIterator(RapidsBufferStore.scala:405) 
    at com.nvidia.spark.rapids.RapidsHostMemoryStore.createBuffer(RapidsHostMemoryStore.scala:133) 
    at com.nvidia.spark.rapids.RapidsBufferStore.copyBuffer(RapidsBufferStore.scala:224) 
    at com.nvidia.spark.rapids.RapidsBufferStore.spillBuffer(RapidsBufferStore.scala:374) 
    at com.nvidia.spark.rapids.RapidsBufferStore.$anonfun$synchronousSpill$4(RapidsBufferStore.scala:311) 
    at com.nvidia.spark.rapids.RapidsBufferStore.$anonfun$synchronousSpill$4$adapted(RapidsBufferStore.scala:304) 
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30) 
    at com.nvidia.spark.rapids.RapidsBufferStore.$anonfun$synchronousSpill$2(RapidsBufferStore.scala:304) 
    at com.nvidia.spark.rapids.RapidsBufferStore.$anonfun$synchronousSpill$2$adapted(RapidsBufferStore.scala:290) 
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30) 
    at com.nvidia.spark.rapids.RapidsBufferStore.synchronousSpill(RapidsBufferStore.scala:290) 
    at com.nvidia.spark.rapids.RapidsBufferCatalog.synchronousSpill(RapidsBufferCatalog.scala:614) 
    at com.nvidia.spark.rapids.DeviceMemoryEventHandler.onAllocFailure(DeviceMemoryEventHandler.scala:154) 
    at ai.rapids.cudf.Table.groupByAggregate(Native Method) 
    at ai.rapids.cudf.Table.access$3300(Table.java:41) 
    at ai.rapids.cudf.Table$GroupByOperation.aggregate(Table.java:4099)

okay so this is a little scary. It looks like it takes 902 MiB to load all of the kernels/data .

CUDA_MODULE_LOADING=EAGER SPARK_CONF_DIR=/.../spark_cluster_scripts/spark_conf/gpu/ spark-shell --conf 'spark.rapids.memory.gpu.pool=NONE'

This disables the memory pool and asks cuda to load all of the kernels and all of the data.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading

After this comes up and finishes loading I ran

$ nvidia-smi 
Tue Nov 12 16:27:38 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
...

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
...
|    1   N/A  N/A   2231030      C   ...m/java-8-openjdk-amd64/jre/bin/java        902MiB |
+-----------------------------------------------------------------------------------------+

There maybe some things we can optimize. We should probably take a look at all of the kernels and see if there is anything that pops out as being a problem.

https://github.com/NVIDIA/spark-rapids-jni/issues/2582 for example might cause us to create more kernels that CUDF would on their own and might be inflating the size of our binary by a lot, even though those kernels would never be loaded.

But assuming that does not drop our memory usage a lot (like 300+ MiB) I think we need to work with the CUDA team to see if there is a way to recover from this type of a failure.

First off we should make sure that the kernels we need for spill are loaded on startup. Then we need if this is a type of error that we can recover from, we need to detect these errors and treat them similar to a retry, assuming that we are running with the ASYNC allocator. To recover I hope that we could roll back to a known good state, drop the maximum/target pool size so that there is room to load more kernels (64 MiB, so that we get some free pages), spill until we are well below that new maximum. Then do a device synchronize before we try again.

NVIDIA / spark-rapids

[BUG] ChunkedPacker can fail at construction or runtime if CUDA runs OOM for kernels #11694