Closed cldellow closed 6 years ago
One option is to decompress the entire row group's worth of data for a column when evaluating rowGroupSatisfiesFilter.
This would let us amortize the cost of some of the once-per-rowgroup checks that we currently do once-per-row.
More importantly, it would ensure we make progress on updating the row group stats - we'd always scan at least one column's row group in its entirety.
A major downside to this is that it would require buffering an entire row group's worth of information in memory, whereas we can currently stream them.
I wrote a tool to make it easier to sort a CSV before creating a parquet file: https://github.com/cldellow/csv2parquet
If you sort the dataset I'm working with by its most commonly used lookup key, cold query time drops from 400ms to 100ms.
Soooo, while I'm sure there's still interesting work to do here, it's not worth pursuing for my use case, so going to close this. :)
Another option: parquet has size statistics, so we could have a knob for "batch decompress for size < XX KB". That would let people balance memory usage against speed. To get the best behaviour, we may have to use a different API -- check if https://github.com/apache/parquet-cpp/commit/2cf2af2dcbaaa13f45325e40b97c907689918411 is relevant
https://github.com/cldellow/sqlite-parquet-vtable/blob/d7c5002ceed2d045b7839c47d2b862418b5ad03d/parquet/parquet_cursor.cc#L681-L688
It does this for simplicity -- so we can populate the row group stats cache.
However, this comes at a cost. Imagine a query that has 3 clauses that operate on different columns. We'll have to decompress all the columns, even if the first clause would prune the row group. This can be a 3x perf penalty in the pathological case :(
If we do more bookkeeping, we should be able to know which clauses have been evaluated fully for a given row group, and do the right thing