cldellow / sqlite-parquet-vtable

A SQLite vtable extension to read Parquet files
Apache License 2.0
267 stars 31 forks source link

add `MATCH` operator for faster `IN` lookups #35

Open cldellow opened 6 years ago

cldellow commented 6 years ago

See discussion at #34.

SQLite handles SELECT * FROM tbl WHERE col IN (a, b, c) by unioning three queries (col = a, col = b, col c). The overhead to seek to a rowgroup and uncompress a large batch just for a single match is inefficient.

Instead, let's implement the MATCH function. eg:

-- for a numeric field
SELECT * FROM tbl WHERE col MATCH '1,2,3'

-- for a string field
SELECT * FROM tbl WHERE col MATCH '''a'',''b'',''c'''

This should support dynamically constructing the clause, e.g.:

SELECT * FROM tbl WHERE col MATCH (SELECT group_concat(quote(value)) FROM lookup WHERE pred)

Ideally we'd support integers, doubles and strings. Integers are highest priority since doubles are less likely to be used as a join key and strings have reasonable support via LIKE already.

Also it'd be best to support this at both the row group and row level.

cldellow commented 6 years ago

By providing a single string that specifies multiple values, this would also make it possible to express variadic IN style queries in datasette's canned query feature.

ATM you need to do a slow/hacky workaround: instr(',35300027,35300028,', ',' || uid || ',') -- would be much cleaner to write uid MATCH '35300027,35300028'