Open colin-ho opened 6 days ago
Comparing colin/dynamic-parquet
(460b060) with main
(274f300)
⚡ 1
improvements
❌ 1
regressions
✅ 15
untouched benchmarks
:warning: Please fix the performance issues or acknowledge them on CodSpeed.
Benchmark | main |
colin/dynamic-parquet |
Change | |
---|---|---|---|---|
⚡ | test_iter_rows_first_row[100 Small Files] |
378.4 ms | 230.9 ms | +63.88% |
❌ | test_show[100 Small Files] |
23.9 ms | 28 ms | -14.55% |
Attention: Patch coverage is 90.36545%
with 29 lines
in your changes missing coverage. Please review.
Project coverage is 77.43%. Comparing base (
84db665
) to head (460b060
). Report is 10 commits behind head on main.
🚨 Try these New Features:
Implement a dynamically parallel local streaming parquet reader.
Background
The current streaming local parquet reader, while fast and streaming, has some problems:
This leads to unnecessarily high memory usage, and it potentially starves downstream tasks.
Solution
Instead of launching all tasks at once, we can incrementally increase the number of parallel deserialization tasks, based on certain factors:
If read time is much longer than deserialization, don't bother spawning more tasks. Conversely, if deserialization takes much longer than reads, then spawn more to get better pipelineing. However, if the wait time to send data is also long, don't spawn more tasks.
This is implemented by a dynamically updated semaphore. The read tasks and compute tasks update the semaphore handle which controls how many permits to increase / or not increase. In order to spawn a new compute task, a semaphore permit must be acquired.
Results
Most glaringly, the benefits of these are in memory usage of streaming queries, for example:
The new implementation hits a peak of 300mb, while the old goes over 1gb.
Another example, where we stream the entire file, but the consumption is slow:
The new implementation hits a peak of 1.2gb, while the old goes over 3gb.
To maintain perfomance parity, I also wrote some benchmarks for parquet files with differing rows / cols / row groups, the results show that the new implementation is pretty much on par, with some slight differences.
On reading a tpch sf-1 lineitem table though: the results are pretty much the same: (~0.2s)