cldellow / datasette-ui-extras

Add editing UI and other power-user features to Datasette.
Apache License 2.0
12 stars 1 forks source link

design autocomplete system #68

Closed cldellow closed 1 year ago

cldellow commented 1 year ago

I've built a prototype. It's not great! I think it can be improved a lot.

Current system

The current system does batch indexing. That happens automatically on startup. The batch size has a threshold to avoid using too much time on startup. This can result in incomplete data, which is misleading.

It stores data in dux_stats_columns_values like:

CREATE TABLE dux_column_stats_values(
  "table" text not null,
  "column" text not null,
  value text,
  count integer not null,
  pks text, -- JSON array of pkeys that have this value
  primary key ("table", "column", value)
);

It has data for columns of text and numeric type. It tracks the pkeys of all rows that have the value, so pks can be quite large.

If the underlying value is a JSON array of strings, it tracks 1 row for each string. It doesn't track pks for these rows.

This table is quite inefficient.

User-facing cases I care about

I care about two use cases.

Autosuggesting a specific row

Autosuggesting popular values

Attributes I care about

Proposed new system

CREATE TABLE dux_ids(
  id integer primary key,
  name text not null
);

CREATE TABLE dux_column_stats_values(
  "table" integer not null REFERENCES dux_ids(id),
  "column" integer not null REFERENCES dux_ids(id),
  value text, -- substr(lower(raw_value), 1, 20)
  hash blob not null, -- md5(raw_value) -- consider: maybe we'll truncate this?
  count integer not null,
  pks text, -- JSON array of pkeys that have this value
  primary key ("table", "column", value, hash)
);

Table shape

What gets indexed

Try to index only string values that don't contain integers or dates:

where typeof(value) == 'text' and not (value >= '1800-01-01' and value <= '9999-12-31') and cast(cast(value as integer) as text) != value

How to use

The autosuggest function will be more complicated. It'll do a first query to get candidate rows. Then it will have to spelunk in the base table to get one of the rows from pks so that it can expand the actual value.

If the value no longer matches (eg you updated the row and we've drifted), we can prune it + queue the row for reindexing.

How does it get built in the first place?

We'll have a table that we use to do full-table reindexes. This will also be how the initial index happens.

CREATE TABLE dux_column_stats_ops(
  "table" integer primary key REFERENCES dux_ids(id),
  last_key text not null default '[]',
  pending integer not null check (pending in (0, 1)),
  rows integer not null default 0
);

On startup, we'll ensure that there is a row in this table for every table.

We'll have a background thread that looks for rows with pending = 1. It'll enumerate the rows using keyset pagination, N rows at a time. It'll ensure there are default rows in dux_column_stats for each column before starting.

How does it get maintained?

This is still up in the air.

Option 1: update/insert/delete AFTER triggers on every table. I feel like this is the most robust, so I want to explore it.

Option 2: duct-tape a bunch of things together that mostly work. e.g. whenever you load a row page, we'll index it.

cldellow commented 1 year ago

How does asyncio work? Should every database get its own task? Example of adding a task: https://github.com/simonw/datasette-scale-to-zero/blob/255296205750d5addc32b93fb8ed898d77f14ddc/datasette_scale_to_zero/__init__.py#L38-L69

I think my life is probably easier if every database gets its own task... but then it might be hard to throttle how much resources we're using, eg if you start up a datasette with 100 databases, it kind of sucks to have 100 coroutines duking it out.

Also, are asyncio tasks the right unit of computation? I think they may not actually be threads, they seem designed for efficient non-blocking I/O things. If we're doing any computation in the python process, we probably want at least a thread, if not a process.

cldellow commented 1 year ago

I did a cursory read on how asyncio and threading work. It's not a great fit if we're doing any computations in python.

The current code does computations in python.

Soooo... is that OK? Options:

  1. Just live with it. Keep our batch sizes small, and sleep for X ms on each iteration so that someone else has a chance to run.
  2. Use multiprocessing. This makes the cross-platform story complicated.
  3. Try to push all updates into SQL, so that we aren't holding the GIL. My gut is that this is possible to do on a row-by-row basis, but maybe quite hard in batch mode.

We'll want (3), at least on a per-row basis, if we're going to do triggers... Although perhaps it'd make sense to treat them separately.

Hm, maybe a middle ground: do things as batches, in python. To handle incremental updates, have triggers that queue the old/new row values for processing by python. That keeps the triggers dead simple, which is attractive.

cldellow commented 1 year ago

...ideally we'd be able to write a trigger that is table-agnostic. For example, in postgres, you can do:

select to_json(connection) from connection limit 1

to convert a row to a JSON object.

I don't think SQLite supports that -- you'd have to generate a trigger that expands things, like:

select json_object('id', old.id), ...

That's annoying, as it means we have to rewrite the trigger when the schema changes, but not the end of the world.

cldellow commented 1 year ago

Perhaps a question: do we need to capture the old value? I was thinking we wanted it so we could keep counts roughly updated, and remove things when their count reached 0.

If we don't, then we could just store their pkey, which is much more likely to be stabled.

Perhaps that's not essential...though, I think it'd feel really obviously wrong if we didn't.

cldellow commented 1 year ago

It doesn't seem like we get a big perf boost by computing stats on multiple columns at once. So perhaps dux_column_stats_ops should track per-column, and we'll do round-robin computation to collect stats on a batch of rows per table/column at a time, so that we quickly get coverage on a little bit of every column.

cldellow commented 1 year ago

Hm, I did the simplest possible thing with asyncio and it seemed to hang forever doing await asyncio.sleep(1)... Let's try a thread.

cldellow commented 1 year ago

D'oh, threads aren't enough, we need asyncio since we want to sit on top of datasette's Database class functions.

Oddly, I can use await in my task...I just can't sleep. Maybe sleep relinquishes control to some other task that never yields back?

cldellow commented 1 year ago

Hm, I'm missing something here. It works if I do it in request_from_actor, and Simon's plugin uses asgi_wrapper. So let's just do that for now.