Closed cldellow closed 1 year ago
How does asyncio work? Should every database get its own task? Example of adding a task: https://github.com/simonw/datasette-scale-to-zero/blob/255296205750d5addc32b93fb8ed898d77f14ddc/datasette_scale_to_zero/__init__.py#L38-L69
I think my life is probably easier if every database gets its own task... but then it might be hard to throttle how much resources we're using, eg if you start up a datasette with 100 databases, it kind of sucks to have 100 coroutines duking it out.
Also, are asyncio tasks the right unit of computation? I think they may not actually be threads, they seem designed for efficient non-blocking I/O things. If we're doing any computation in the python process, we probably want at least a thread, if not a process.
I did a cursory read on how asyncio and threading work. It's not a great fit if we're doing any computations in python.
The current code does computations in python.
Soooo... is that OK? Options:
We'll want (3), at least on a per-row basis, if we're going to do triggers... Although perhaps it'd make sense to treat them separately.
Hm, maybe a middle ground: do things as batches, in python. To handle incremental updates, have triggers that queue the old/new row values for processing by python. That keeps the triggers dead simple, which is attractive.
...ideally we'd be able to write a trigger that is table-agnostic. For example, in postgres, you can do:
select to_json(connection) from connection limit 1
to convert a row to a JSON object.
I don't think SQLite supports that -- you'd have to generate a trigger that expands things, like:
select json_object('id', old.id), ...
That's annoying, as it means we have to rewrite the trigger when the schema changes, but not the end of the world.
Perhaps a question: do we need to capture the old value? I was thinking we wanted it so we could keep counts roughly updated, and remove things when their count reached 0.
If we don't, then we could just store their pkey, which is much more likely to be stabled.
Perhaps that's not essential...though, I think it'd feel really obviously wrong if we didn't.
It doesn't seem like we get a big perf boost by computing stats on multiple columns at once. So perhaps dux_column_stats_ops
should track per-column, and we'll do round-robin computation to collect stats on a batch of rows per table/column at a time, so that we quickly get coverage on a little bit of every column.
Hm, I did the simplest possible thing with asyncio and it seemed to hang forever doing await asyncio.sleep(1)
... Let's try a thread.
D'oh, threads aren't enough, we need asyncio since we want to sit on top of datasette's Database class functions.
Oddly, I can use await
in my task...I just can't sleep. Maybe sleep
relinquishes control to some other task that never yields back?
Hm, I'm missing something here. It works if I do it in request_from_actor
, and Simon's plugin uses asgi_wrapper
. So let's just do that for now.
ensure_schema
- ensures we have sanedux_ids
,dux_column_stats
,dux_column_stats_values
,dux_column_stats_cursor
tablescrypto.so
(or whatever) is loadeddux_column_stats_cursor
is populatedI've built a prototype. It's not great! I think it can be improved a lot.
Current system
The current system does batch indexing. That happens automatically on startup. The batch size has a threshold to avoid using too much time on startup. This can result in incomplete data, which is misleading.
It stores data in
dux_stats_columns_values
like:It has data for columns of text and numeric type. It tracks the pkeys of all rows that have the value, so
pks
can be quite large.If the underlying value is a JSON array of strings, it tracks 1 row for each string. It doesn't track pks for these rows.
This table is quite inefficient.
User-facing cases I care about
I care about two use cases.
Autosuggesting a specific row
col
, seeColin Dellow
as a suggestion, pick it, and be navigated to the row page.can
, seeCanada
as a suggestion, pick it, and have the ID saved.Autosuggesting popular values
ger
, seeCountry = Germany
as a suggestion, pick it, and have an appropriate filter added to their searchbak
, seebaking
as a suggestion and pick it.tea
, seeTeacher
as a suggestion and pick it.Attributes I care about
Proposed new system
Table shape
value >= 'foo' and value <= 'fooZ'
for fast prefix searchessubstr(md5(raw_value), 1, 8)
. According to some random site, you have a 1 in a million chance of a collision with 6,000,000 64-bit hashes. Since I'm targeting tables with ~1M rows, this seems fine. The math I think is also even better, since we won't actually be looking up by hash, but by prefix + hash.What gets indexed
Try to index only string values that don't contain integers or dates:
How to use
The autosuggest function will be more complicated. It'll do a first query to get candidate rows. Then it will have to spelunk in the base table to get one of the rows from
pks
so that it can expand the actual value.If the value no longer matches (eg you updated the row and we've drifted), we can prune it + queue the row for reindexing.
How does it get built in the first place?
We'll have a table that we use to do full-table reindexes. This will also be how the initial index happens.
On startup, we'll ensure that there is a row in this table for every table.
We'll have a background thread that looks for rows with
pending = 1
. It'll enumerate the rows using keyset pagination, N rows at a time. It'll ensure there are default rows indux_column_stats
for each column before starting.How does it get maintained?
This is still up in the air.
Option 1: update/insert/delete
AFTER
triggers on every table. I feel like this is the most robust, so I want to explore it.Option 2: duct-tape a bunch of things together that mostly work. e.g. whenever you load a row page, we'll index it.