NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
792 stars 230 forks source link

[FEA] Improve Performance of GPU shuffle on Celeborn #10790

Open winningsix opened 4 months ago

winningsix commented 4 months ago

Is your feature request related to a problem? Please describe. To achieve better stability, remote shuffle becomes a new technology trend. And Uniffle and Celeborn are most widely used options in PRC side. Beginning with Celeborn, we should have a good support with GPU acceleration for the normal shuffle path.

Describe the solution you'd like As for client side, the shuffle is bypassing the sort like normal shuffle, we can directly make partition, serialization and compression on GPU other than host side per batch.

Performance wise, we want to have 20X performance gain (op time) against single CPU core of recent a few generations.

Feature scope wise, we want to: (1) Move shuffle partition, serialization, compression onto GPU. And the targeted compression codec is about ZSTD. (https://github.com/NVIDIA/spark-rapids/issues/10841) (2) Based (1), it could seamless work with vanilla Celeborn shuffle manager. But it involves one memory copy from native to Java. One alternative is to leverage pushData or mergeData but in native way to reduce extra memory copy. (3) Introduce a heuristic based approach based on compression ratio. The initial state could be either CPU or GPU shuffle. By analyzing 1st coming batch, it could calculate the compression ratio. If the compression ratio of the 1st few batches is above the compression ratio threshold. It will use the GPU based approach other wise it uses CPU based.

Non-Goal is to include encryption at the rest support.

Describe alternatives you've considered Celeborn can work seamless without moving thing on GPU. Thus, CPU based implementation will be another alternative.

zhanglistar commented 4 months ago

Great!

firestarman commented 4 months ago

Celeborn works as a normal Spark shuffle manager, so Plugin always works well with it. This issue is just to track if enabling GPU slicing and compression can get better perf when working with Celeborn. Or at least get better perf from some customer queries.

waitinfuture commented 4 months ago

Hi, as a committer from Celeborn community, I'd like to help if any features are required from Celeborn, and you're always welcome to contribute to Celeborn :)

firestarman commented 4 months ago

Really appreciate that @waitinfuture. Will let you know if any action is required from Celeborn side.

revans2 commented 4 months ago

I updated the name of this issue to make it clear. Our shuffle works with Celeborn, but the goal here is to improve the performance of that shuffle.

@firestarman and @winningsix If we have patches that improve the performance could you please explain how GPU compression and slicing improves the performance? In the past we tried to do compression on the GPU as a part of shuffle and the performance was generally worse because of the opportunity cost. Generally the CPU was mostly idle waiting for the GPU to finish, and by offloading the shuffle data to the CPU for compression improved the performance, especially if we could do the compression using multiple CPU threads. I really would like to understand how this improves performance, what tests have been run so we can know which situations should enable this and which should not.

winningsix commented 4 months ago

I updated the name of this issue to make it clear. Our shuffle works with Celeborn, but the goal here is to improve the performance of that shuffle.

@firestarman and @winningsix If we have patches that improve the performance could you please explain how GPU compression and slicing improves the performance? In the past we tried to do compression on the GPU as a part of shuffle and the performance was generally worse because of the opportunity cost. Generally the CPU was mostly idle waiting for the GPU to finish, and by offloading the shuffle data to the CPU for compression improved the performance, especially if we could do the compression using multiple CPU threads. I really would like to understand how this improves performance, what tests have been run so we can know which situations should enable this and which should not.

Thanks for the title update. It looks more suitable. The benefiting point comes from a notable compression ratio (saying 3 ~ 10) in customer queries. If higher CR, less time spent in device-to-host. We should introduce a heuristic approach to determine GPU shuffle based on the data pattern. For current case, I would suggest to start with compression ratio per batch as the determining point. For example, if the compression ratio > 3 for 1st batch handled by current executor, it will sit on GPU shuffle, otherwise CPU shuffle compression.