Support serializing packed tables directly for the normal shuffle path

firestarman commented 1 month ago

Contribute to https://github.com/NVIDIA/spark-rapids/issues/10790 Fix https://github.com/NVIDIA/spark-rapids/issues/10841

This PR is trying to accelerate the normal shuffle path by partitioning and slicing tables on GPU.

The sliced table is already serializable so can be written to the Shuffle output stream directly, along with a lightweight metadata (a TableMeta) to rebuild the table on the Shuffle read side.

On the Shuffle read side, the new introduced PackedTableIterator will read the tables from the Shuffle input stream and rebuild them on GPU by leveraging the existing utils (MetaUtils, GpuCompressedColumnVector). Next, the existing GpuCoalesceBatches node is used to do the batch concatenation for the downstream operators, similar as what Rapids Shuffle does.

It led to some perf degression in NDS runs, so disable this feature by default. But we got about 2x speedup for a customer query (We got this only when setting the executor cores to 2, but it supposed to be 16).

Waiting for more tests ...

Numbers of 3k parquest data on our cluster.

// ==GPU Serde
app-20240517075217-0003,Power Test Time,607000

// ==CPU Serde
app-20240517070754-0000,Power Test Time,585000

firestarman commented 1 month ago

Make it draft because there are still 5 unit tests failing.

firestarman commented 4 weeks ago

WAR the failing tests by disabling the GPU serde, and filed an issue (https://github.com/NVIDIA/spark-rapids/issues/10823) to track the follow-up

firestarman commented 4 weeks ago

build

firestarman commented 4 weeks ago

build

abellina commented 4 weeks ago

It led to some perf degression in NDS runs, so disable this feature by default.

Which queries were slower? It would be great to get some feedback from you on what is different between the customer query and the NDS queries.

Also which queries got faster from NDS? That would be interesting.

I did also write internally as I'd like to see more standard configurations used for this benchmark on the next run, so we can compare apples-to-apples with our baseline.

firestarman commented 3 weeks ago

build

sameerz commented 3 weeks ago

Please add more context about why the test cases in #10823 are failing before merging this PR. We'd like to understand if that issue needs to be addressed as part of this PR.

firestarman commented 2 weeks ago

Please add more context about why the test cases in https://github.com/NVIDIA/spark-rapids/issues/10823 are failing before merging this PR. We'd like to understand if that issue needs to be addressed as part of this PR.

Done

firestarman commented 2 weeks ago

Move to draft since the perf is not as good as our expectation. The previous 2x speedup was got only when setting the executor cores to 2, but it supposed to be 16.

NVIDIA / spark-rapids

Support serializing packed tables directly for the normal shuffle path #10818