cldellow / sqlite-parquet-vtable

A SQLite vtable extension to read Parquet files
Apache License 2.0
267 stars 31 forks source link

rowSatisfiesFilter eagerly evaluates constraints #10

Closed cldellow closed 6 years ago

cldellow commented 6 years ago

https://github.com/cldellow/sqlite-parquet-vtable/blob/d7c5002ceed2d045b7839c47d2b862418b5ad03d/parquet/parquet_cursor.cc#L681-L688

It does this for simplicity -- so we can populate the row group stats cache.

However, this comes at a cost. Imagine a query that has 3 clauses that operate on different columns. We'll have to decompress all the columns, even if the first clause would prune the row group. This can be a 3x perf penalty in the pathological case :(

If we do more bookkeeping, we should be able to know which clauses have been evaluated fully for a given row group, and do the right thing

cldellow commented 6 years ago

See comments in https://github.com/cldellow/sqlite-parquet-vtable/commit/e1a86954e517cdc00b61e3ede70e789524419298

cldellow commented 6 years ago

One option is to decompress the entire row group's worth of data for a column when evaluating rowGroupSatisfiesFilter.

This would let us amortize the cost of some of the once-per-rowgroup checks that we currently do once-per-row.

More importantly, it would ensure we make progress on updating the row group stats - we'd always scan at least one column's row group in its entirety.

A major downside to this is that it would require buffering an entire row group's worth of information in memory, whereas we can currently stream them.

cldellow commented 6 years ago

I wrote a tool to make it easier to sort a CSV before creating a parquet file: https://github.com/cldellow/csv2parquet

If you sort the dataset I'm working with by its most commonly used lookup key, cold query time drops from 400ms to 100ms.

Soooo, while I'm sure there's still interesting work to do here, it's not worth pursuing for my use case, so going to close this. :)

cldellow commented 6 years ago

Another option: parquet has size statistics, so we could have a knob for "batch decompress for size < XX KB". That would let people balance memory usage against speed. To get the best behaviour, we may have to use a different API -- check if https://github.com/apache/parquet-cpp/commit/2cf2af2dcbaaa13f45325e40b97c907689918411 is relevant