facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
https://velox-lib.io/
Apache License 2.0
3.27k stars 1.08k forks source link

Parquet Writer Hotspot on memmove #5066

Open yyang52 opened 1 year ago

yyang52 commented 1 year ago

Description

ETL (Velox w/ Spark/Gluten) is an essential part for data analytics which involves data files writing. Currently, Velox implements Parquet writer with Arrow writer, which supports various compression codecs. For ETL workload, ZSTD is a commonly-used compression method. And we had some benchmarks/tests to check the hotspot and see if some optimization potentials existed.

Since Velox doesn't have a parquet writer benchmark, we implement a simple benchmark to write parquet files with TpchGen tables. While the workload profiling showed that the hotspots lay on __memmove_avx_unaligned_erms (> 50%), while ZSTD_compress only takes less than 10% time. While benchmarks on Arrow side gave a higher percentage of ZSTD_compress (more than 15% and > 80% when running column_io_benchmark)

Velox workload: image Arrow workload: image image

Not sure if that's due to the different implementations of memory pool or I'm using some improper workloads. Do we happen to have some profiling data on parquet writer or is there any plan to optimize this part with SW/HW accelerators?

yyang52 commented 1 year ago

@oerling Any thoughts on that? Thanks!

zhouyuan commented 1 year ago

CC: @JkSelf

JkSelf commented 1 year ago

Can you try to see if this PR#4854 can solve your problem? It seems that the performance degradation caused by frequent allocate and copy in the reallocate method of DataBuffer.

yyang52 commented 1 year ago

Thanks for providing that! Seems like we encounter a similar issue, as I also found reservce method took a lot of time. Will try this PR to see if the problem could be fixed.

yyang52 commented 1 year ago

Can you try to see if this PR#4854 can solve your problem? It seems that the performance degradation caused by frequent allocate and copy in the reallocate method of DataBuffer.

I have tried this PR and it does solve the problem! The hotspot becomes compression as memmove doesn't take that much time.