NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
792 stars 230 forks source link

[FEA] Look at using cudf::sample in GpuRangePartitioner #3180

Open revans2 opened 3 years ago

revans2 commented 3 years ago

Is your feature request related to a problem? Please describe. The range partitioner needs to do sampling without replacement. The current code uses our version of SamplingUtils and plays some games to create a gather map to sample rows from a batch of data. cudf now supports this functionality through sample. It looks fairly simple to plumb it all together. The main thing we need to look at is if it improves performance.

revans2 commented 3 years ago

I did some profiling on a simple sort and it looks to be a very small win at best. The sampling itself for this simple case was only about 2 ms out of 5 ms to do all of the sub-sampling for the task. Which was a single batch. So there is a lot of overhead there that we should also probably look into. That said, the entire query run we over 60 seconds, and the sub-sampling portion of it, including generating the data, was only 100ms. So if we are looking at a very small improvement at best. This is probably a low priority at this time.