NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
822 stars 235 forks source link

Add support for asynchronous writing for parquet #11730

Open jihoonson opened 3 days ago

jihoonson commented 3 days ago

Description

This PR is the first work for https://github.com/NVIDIA/spark-rapids/issues/11342. It adds new configurations explained in the below. Please see the configuration docs in the PR for more details.

Performance test results

query time, spilled bytes, retry count, batchSize=100M

retry count, retry block time, batchSize=100M

The charts above show some performance test results of the async output writing. The test setup was:

The results show that the async writing + holding gpu between batches improved the query time by about 11% comparing to sync writing + releasing gpu between batches (current behavior). This was because of the less memory pressure, and thus less spills and retries. Interestingly, the retry block time was increased with async writing + holding gpu. This seems because the async write reduced the memory pressure, and thus many tasks were able to proceed further and even finish without throwing out-of-memory errors. As a result, the tasks blocked due to the memory allocation failure had to wait longer until running tasks finish their job and release memory.

Future work

jihoonson commented 3 days ago

build

jihoonson commented 3 days ago

build

jihoonson commented 2 days ago

build

jihoonson commented 2 days ago

build

jihoonson commented 2 days ago

build

abellina commented 13 hours ago

Will do another pass today.

jihoonson commented 8 hours ago

build