cldellow / sqlite-parquet-vtable

A SQLite vtable extension to read Parquet files
Apache License 2.0
267 stars 31 forks source link

optionally memoize filter -> rowgroup mappings? #6

Closed cldellow closed 6 years ago

cldellow commented 6 years ago

In the census dataset, a query for city == 'Dawson Creek' isn't able to effectively prune row groups, since they're not sorted. So a rowgroup may have min=Abbotsford, max=Vancouver, those are the only two cities in the row group, but we still decode the row group. :(

eg, if we split the data into ~256 rowgroups of 50k each, we prune only 12 rowgroups. We examine 244, but only 1 of those is responsive. (If we used the low-level page API, we could do something similar directly using the parquet dictionary, but https://github.com/apache/arrow/issues/1426#issuecomment-352265833 makes me think that's going to be beyond my skills.)

So, it might be nice if we detected this and memoized the mapping for future queries. Optionally, we could persist it to disk, although then we'd have to invalidate the mapping if the parquet file changed.

Goal:

Maybe each clause should get its own mappings and we can flexibly AND them together at query time?

Implementation note: sqlite may bail before scanning the entire dataset (eg, if a LIMIT clause is present) -- we should either only store the mapping when we detect that a full scan was done, or store a marker indicating which rowgroup to resume processing at.

cldellow commented 6 years ago

We could create a shadow table in sqlite to store this metadata; then people can opt into/out of persistence by choosing to use an in-memory or on-disk db.