cldellow commented 1 year ago

[x] Extract ensure_schema - ensures we have sane dux_ids, dux_column_stats, dux_column_stats_values, dux_column_stats_cursor tables
[x] Ensure crypto.so (or whatever) is loaded
[x] Revamp startup to ensure dux_column_stats_cursor is populated
[x] Extract a worker thread
[x] rewrite the autocomplete endpoint to use new schema

I've built a prototype. It's not great! I think it can be improved a lot.

Current system

The current system does batch indexing. That happens automatically on startup. The batch size has a threshold to avoid using too much time on startup. This can result in incomplete data, which is misleading.

It stores data in dux_stats_columns_values like:

CREATE TABLE dux_column_stats_values(
  "table" text not null,
  "column" text not null,
  value text,
  count integer not null,
  pks text, -- JSON array of pkeys that have this value
  primary key ("table", "column", value)
);

It has data for columns of text and numeric type. It tracks the pkeys of all rows that have the value, so pks can be quite large.

If the underlying value is a JSON array of strings, it tracks 1 row for each string. It doesn't track pks for these rows.

This table is quite inefficient.

User-facing cases I care about

I care about two use cases.

Autosuggesting a specific row

On the table page, a user should be able to type col, see Colin Dellow as a suggestion, pick it, and be navigated to the row page.
On the row page, while editing a row that has a foreign key to a countries table, a user should be able to type can, see Canada as a suggestion, pick it, and have the ID saved.

Autosuggesting popular values

On the table page, a user should be able to type ger, see Country = Germany as a suggestion, pick it, and have an appropriate filter added to their search
- Ideally we can rank by popularity, but I'm flexible on this if it's a pain
On the row page, when editing a JSON array of tags, a user can type bak, see baking as a suggestion and pick it.
On the row page, when editiing a text column, a user can type tea, see Teacher as a suggestion and pick it.

Attributes I care about

Fast startup time, then stats backfill over time
Backfilling shouldn't lock up the web server -- maybe this "just works" if we do minimal processing in Python and mainly do updates via SQL?
Support medium-size tables, eg 1M rows
Have fixed overhead per row, eg a very wide string should use O(1) space
No false negatives -- if a user types "ger", fails to see Germany, they will feel the system is broken. Worse, they might create a new Germany
False positives are OK - they can be pruned at runtime
Supports multiple rows with the same label
Supports re-configuring the label parameter at any time
Ability to do incremental updates

Proposed new system

CREATE TABLE dux_ids(
  id integer primary key,
  name text not null
);

CREATE TABLE dux_column_stats_values(
  "table" integer not null REFERENCES dux_ids(id),
  "column" integer not null REFERENCES dux_ids(id),
  value text, -- substr(lower(raw_value), 1, 20)
  hash blob not null, -- md5(raw_value) -- consider: maybe we'll truncate this?
  count integer not null,
  pks text, -- JSON array of pkeys that have this value
  primary key ("table", "column", value, hash)
);

Table shape

Store table/column as ints
Limit pks to a max of N items
Store a normalized prefix of the string, so we can do value >= 'foo' and value <= 'fooZ' for fast prefix searches
CONSIDER: truncate the hash? substr(md5(raw_value), 1, 8). According to some random site, you have a 1 in a million chance of a collision with 6,000,000 64-bit hashes. Since I'm targeting tables with ~1M rows, this seems fine. The math I think is also even better, since we won't actually be looking up by hash, but by prefix + hash.
- If we did truncate the hash, it might be nice to store it as an integer for dev ergonomics. This might be hard/inefficient? Looks like SQLite does not provide a built in way to translate from 4-/8-byte blobs to numbers and back.
We'll store pks for JSON array values, too.

What gets indexed

Try to index only string values that don't contain integers or dates:

where typeof(value) == 'text' and not (value >= '1800-01-01' and value <= '9999-12-31') and cast(cast(value as integer) as text) != value

How to use

The autosuggest function will be more complicated. It'll do a first query to get candidate rows. Then it will have to spelunk in the base table to get one of the rows from pks so that it can expand the actual value.

If the value no longer matches (eg you updated the row and we've drifted), we can prune it + queue the row for reindexing.

How does it get built in the first place?

We'll have a table that we use to do full-table reindexes. This will also be how the initial index happens.

CREATE TABLE dux_column_stats_ops(
  "table" integer primary key REFERENCES dux_ids(id),
  last_key text not null default '[]',
  pending integer not null check (pending in (0, 1)),
  rows integer not null default 0
);

On startup, we'll ensure that there is a row in this table for every table.

We'll have a background thread that looks for rows with pending = 1. It'll enumerate the rows using keyset pagination, N rows at a time. It'll ensure there are default rows in dux_column_stats for each column before starting.

How does it get maintained?

This is still up in the air.

Option 1: update/insert/delete AFTER triggers on every table. I feel like this is the most robust, so I want to explore it.

Option 2: duct-tape a bunch of things together that mostly work. e.g. whenever you load a row page, we'll index it.

cldellow commented 1 year ago

How does asyncio work? Should every database get its own task? Example of adding a task: https://github.com/simonw/datasette-scale-to-zero/blob/255296205750d5addc32b93fb8ed898d77f14ddc/datasette_scale_to_zero/__init__.py#L38-L69

I think my life is probably easier if every database gets its own task... but then it might be hard to throttle how much resources we're using, eg if you start up a datasette with 100 databases, it kind of sucks to have 100 coroutines duking it out.

Also, are asyncio tasks the right unit of computation? I think they may not actually be threads, they seem designed for efficient non-blocking I/O things. If we're doing any computation in the python process, we probably want at least a thread, if not a process.

cldellow commented 1 year ago

I did a cursory read on how asyncio and threading work. It's not a great fit if we're doing any computations in python.

The current code does computations in python.

Soooo... is that OK? Options:

Just live with it. Keep our batch sizes small, and sleep for X ms on each iteration so that someone else has a chance to run.
Use multiprocessing. This makes the cross-platform story complicated.
Try to push all updates into SQL, so that we aren't holding the GIL. My gut is that this is possible to do on a row-by-row basis, but maybe quite hard in batch mode.

We'll want (3), at least on a per-row basis, if we're going to do triggers... Although perhaps it'd make sense to treat them separately.

Hm, maybe a middle ground: do things as batches, in python. To handle incremental updates, have triggers that queue the old/new row values for processing by python. That keeps the triggers dead simple, which is attractive.

cldellow commented 1 year ago

...ideally we'd be able to write a trigger that is table-agnostic. For example, in postgres, you can do:

select to_json(connection) from connection limit 1

to convert a row to a JSON object.

I don't think SQLite supports that -- you'd have to generate a trigger that expands things, like:

select json_object('id', old.id), ...

That's annoying, as it means we have to rewrite the trigger when the schema changes, but not the end of the world.

cldellow commented 1 year ago

Perhaps a question: do we need to capture the old value? I was thinking we wanted it so we could keep counts roughly updated, and remove things when their count reached 0.

If we don't, then we could just store their pkey, which is much more likely to be stabled.

Perhaps that's not essential...though, I think it'd feel really obviously wrong if we didn't.

cldellow commented 1 year ago

It doesn't seem like we get a big perf boost by computing stats on multiple columns at once. So perhaps dux_column_stats_ops should track per-column, and we'll do round-robin computation to collect stats on a batch of rows per table/column at a time, so that we quickly get coverage on a little bit of every column.

cldellow commented 1 year ago

Hm, I did the simplest possible thing with asyncio and it seemed to hang forever doing await asyncio.sleep(1)... Let's try a thread.

cldellow commented 1 year ago

D'oh, threads aren't enough, we need asyncio since we want to sit on top of datasette's Database class functions.

Oddly, I can use await in my task...I just can't sleep. Maybe sleep relinquishes control to some other task that never yields back?

cldellow commented 1 year ago

Hm, I'm missing something here. It works if I do it in request_from_actor, and Simon's plugin uses asgi_wrapper. So let's just do that for now.

cldellow / datasette-ui-extras

design autocomplete system #68

Current system

User-facing cases I care about

Autosuggesting a specific row

Autosuggesting popular values

Attributes I care about

Proposed new system

Table shape

What gets indexed

How to use

How does it get built in the first place?

How does it get maintained?