cldellow / datasette-ui-extras

Add editing UI and other power-user features to Datasette.
Apache License 2.0
12 stars 1 forks source link

faster autocomplete searches #67

Closed cldellow closed 1 year ago

cldellow commented 1 year ago

We currently do autosuggest like:

SELECT value, count, pks
FROM dux_column_stats_values
WHERE "table" = ? AND "column" = ? AND lower(value) LIKE ?
ORDER BY count DESC
LIMIT 10

This searches all the values for the column, which can be slow for large tables. For example, searching the name field of the 200K row geonames table takes 50-100ms. Adding an index on value and searching with value > 'Needle' AND value < 'NeedleZ' makes it basically instantaneous, which I think would be necessary for some other ideas I have.

But then:

Could/should we just use FTS? Should we use a separate table per column? I want to keep things operationally simple...

cldellow commented 1 year ago

SELECT name, SUM("pgsize") FROM "dbstat" group by 1 order by 2 desc shows the sizes used. I think repeating the table and column name might be kinda expensive. Hmm.

cldellow commented 1 year ago

eg in superuser:

sqlite> create table test as select value, pks from dux_column_stats_values;
sqlite> SELECT name, SUM("pgsize") FROM "dbstat" group by 1 order by 2 desc;
dux_column_stats_values|324923392
sqlite_autoindex_dux_column_stats_values_1|272764928
test|264204288

So storing the table/column bloats the table from 264MB to 324MB, 22%

Oh, hmm. We already pay a big b-tree price because we have a pkey on (table,column,value), so yeah, it'd be "free" to store lower case ones. And if we did a table per column, we'd save 22% overhead (at the price of a lot more tables... which might make other parts of Datasette angry)

cldellow commented 1 year ago

Should also test: does indexing only text values help? Eg we're not going to autocomplete ints, so paying the storage cost for that row (table name, column name, pkey) sucks

cldellow commented 1 year ago

Re size of stats table: normalizing the table name and column names will go pretty far towards fixing that.

Other things to consider: make the pkey

(Table ID, column ID, lower(label)[0:10], hash)

This bounds the size of the key, so we're not repeating ourselves too much in the table and the index. If we want the underlying value, we can always interrogate the underlying object.

https://github.com/nalgeon/sqlean/blob/main/docs/crypto.md can provide hash fns

cldellow commented 1 year ago

Other idea: limit the posting list size. We have the counts separately, we don't really need to maintain the full list of pkeys.

We might also store the pkeys as tuples rather than objects. Although storing as an object means we can detect if the schema has changed in an incompatible way

cldellow commented 1 year ago

Good experimenting here, closing in favour of #68