NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
789 stars 228 forks source link

[BUG] Allow memory allocation from pinned memory pool to avoid task fail #11426

Open winningsix opened 2 weeks ago

winningsix commented 2 weeks ago

Describe the bug Pinned memory is treated as precious resource and typically it's configured less than the rest part of overhead memory. Following this philosophy, some allocations explicitly avoid the usage of pinned memory. But there exist some chances users configured pinned memory higher than the rest of part of overhead memory. In such case, the task is at risky to fail. For example,

Image

Expected behavior Similar to https://github.com/rapidsai/cudf/blob/ad1369d2d6eabf4b0ae480a10463a74f3034aece/java/src/main/java/ai/rapids/cudf/PinnedMemoryPool.java#L173-L180. we would expect a similar approach to try allocate non-pinned memory and then pinned memory if failed to allocate pinned memory pool.

mattahrens commented 1 week ago

Actual fix will be in cudf code base.