NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
749 stars 221 forks source link

Introduce low shuffle merge. #10786

Closed liurenjie1024 closed 1 week ago

liurenjie1024 commented 1 month ago

Close #10905 . This pr is the first one to introduces low shuffle merge optimization to speed up merge. Currently we only support databricks 13.3, we will add support more versions once this pr gets merged.

liurenjie1024 commented 1 month ago

build

liurenjie1024 commented 1 month ago

build

liurenjie1024 commented 1 month ago

build

liurenjie1024 commented 1 month ago

cc @jlowe @razajafri I've fixed comments and added integrations test, PTAL.

liurenjie1024 commented 1 month ago

build

liurenjie1024 commented 1 month ago

build

liurenjie1024 commented 1 month ago

cc @jlowe I've fixed all comments, PTAL

liurenjie1024 commented 1 month ago

build

liurenjie1024 commented 4 weeks ago

cc @jlowe I have fixed all tests and it should work now, but with some following issues to resolve:

  1. Implement true row index for other parquet scan modes, which currently only supports PERFILE scan.
  2. Push filename grouping into GpuFileSourceScanExec to remove the limitation of one file per partition.
  3. Add support for all other platforms.
liurenjie1024 commented 4 weeks ago

build

liurenjie1024 commented 3 weeks ago

build

liurenjie1024 commented 3 weeks ago

build

liurenjie1024 commented 3 weeks ago

build

liurenjie1024 commented 3 weeks ago

build

liurenjie1024 commented 3 weeks ago

build

liurenjie1024 commented 3 weeks ago

build

liurenjie1024 commented 3 weeks ago

build

liurenjie1024 commented 3 weeks ago

build

liurenjie1024 commented 3 weeks ago

build

liurenjie1024 commented 3 weeks ago

Thanks for all the updates, @liurenjie1024! This is getting close. Would be good to file the followup issues, ideally pointing to them with TODO's in the code. Also need performance numbers as mentioned before.

Sure, I will do some experiments to measure performance improvements .

liurenjie1024 commented 1 week ago

Close by #10979