Consider using RMM for all device allocations instead of `cudaMalloc`

While browsing briefly through the new PCA implementation, I noticed there are still several places where GPU memory is being allocated using cudaMalloc. Throughout this file, for example. In RAPIDS, we replace all calls to cudaMalloc in our code and use rmm to allocate any and all memory.

There are a few reasons why this is important

A user can set a single pool allocation size and apply it to a single device.
Allocations will be guaranteed to be aligned to 256 byte boundaries.
Every call directly to cudaMalloc imposes a device-wide synchronization. This can be avoided for the entire device when a pool allocator is configured.
Asynchronous streams which are tied to memory allocations can be maintained along with those allocations.
All allocations on each device will be guaranteed to use the same managed memory pool. This can be the source of problems when some allocations are done through RMM and others are done through cudaMalloc

Further, RMM provides an RAII C++ API that makes managing these pointers easier and less prone to memory leaks. By just using the rmm::device_uvector and smart pointers instead of cudaMalloc, the algorithms in this repository will automatically benefit from the items listed above.

NVIDIA / spark-rapids-ml

Consider using RMM for all device allocations instead of `cudaMalloc` #3