Add support for asynchronous writing for parquet

jihoonson commented 3 days ago

Description

This PR is the first work for https://github.com/NVIDIA/spark-rapids/issues/11342. It adds new configurations explained in the below. Please see the configuration docs in the PR for more details.

spark.rapids.sql.asyncWrite.queryOutput.enabled: the option to enable the async write. Only parquet is supported currently. ORC is planned to be supported next. This option is off by default currently, and can be enabled by default in the future once it is tested more in production.
spark.rapids.sql.queryOutput.holdGpuInTask: another option to enable holding GPU between processing batches during the output write. This option is recommended to be enabled when async output is enabled. This option is defaulted to the value of spark.rapids.sql.asyncWrite.queryOutput.enabled.
spark.rapids.sql.asyncWrite.maxInFlightHostMemoryBytes: max in-flight bytes per executor that can be buffered for async write.

Performance test results

query time, spilled bytes, retry count, batchSize=100M

retry count, retry block time, batchSize=100M

The charts above show some performance test results of the async output writing. The test setup was:

NDS dataset at sf=100 was used for testing. The generated dataset was stored in the Parquet format in google cloud storage.

The below query was used for testing. The output size was about 37.9 GB.

select s.* from store s, store_sales ss
where s.s_store_sk <= ss.ss_store_sk and s.s_store_sk + 41 > ss.ss_store_sk;

2 workers were used for testing. The executor and the task counts were set to 32.
The GPU parallelism was set to 4.

The results show that the async writing + holding gpu between batches improved the query time by about 11% comparing to sync writing + releasing gpu between batches (current behavior). This was because of the less memory pressure, and thus less spills and retries. Interestingly, the retry block time was increased with async writing + holding gpu. This seems because the async write reduced the memory pressure, and thus many tasks were able to proceed further and even finish without throwing out-of-memory errors. As a result, the tasks blocked due to the memory allocation failure had to wait longer until running tasks finish their job and release memory.

Future work

I did not any integration test in this PR. I'm planning add some in a follow-up.
ORC format will be supported in a follow-up.
The default of spark.rapids.sql.asyncWrite.maxInFlightHostMemoryBytes is currently fixed. This is not only not smart, but also could cause oom errors if it conflicts with other memory settings. It would be nice if its default can be computed based on the executor memory settings. This can be done in a follow-up as well.

jihoonson commented 3 days ago

build

jihoonson commented 3 days ago

build

jihoonson commented 2 days ago

build

jihoonson commented 2 days ago

build

jihoonson commented 2 days ago

build

abellina commented 13 hours ago

Will do another pass today.

jihoonson commented 8 hours ago

build

NVIDIA / spark-rapids