NVIDIA / spark-rapids-ml

Spark RAPIDS MLlib – accelerate Apache Spark MLlib with GPUs
https://nvidia.github.io/spark-rapids-ml/
Apache License 2.0
64 stars 30 forks source link

Consider using RMM for all device allocations instead of `cudaMalloc` #3

Open cjnolet opened 3 years ago

cjnolet commented 3 years ago

While browsing briefly through the new PCA implementation, I noticed there are still several places where GPU memory is being allocated using cudaMalloc. Throughout this file, for example. In RAPIDS, we replace all calls to cudaMalloc in our code and use rmm to allocate any and all memory.

There are a few reasons why this is important

  1. A user can set a single pool allocation size and apply it to a single device.
  2. Allocations will be guaranteed to be aligned to 256 byte boundaries.
  3. Every call directly to cudaMalloc imposes a device-wide synchronization. This can be avoided for the entire device when a pool allocator is configured.
  4. Asynchronous streams which are tied to memory allocations can be maintained along with those allocations.
  5. All allocations on each device will be guaranteed to use the same managed memory pool. This can be the source of problems when some allocations are done through RMM and others are done through cudaMalloc

Further, RMM provides an RAII C++ API that makes managing these pointers easier and less prone to memory leaks. By just using the rmm::device_uvector and smart pointers instead of cudaMalloc, the algorithms in this repository will automatically benefit from the items listed above.

wjxiz1992 commented 3 years ago

Thanks for pointing this out Corey! This has been added to the TODO plan. I've been working on a virtual review for the recent release, I will update once I finished the current work.