datasette / datasette-enrichments

Tools for running enrichments against data stored in Datasette
https://enrichments.datasette.io
Apache License 2.0
19 stars 0 forks source link

Design database schema #2

Open simonw opened 2 years ago

simonw commented 2 years ago

Originally posted by @simonw in https://github.com/simonw/datasette-enrichments/issues/1#issuecomment-1034384356

simonw commented 2 years ago

An enrichment starts with a user kicking one off. They will generally select a set of records (often but not always everything in a table), pick one or more columns (eg address to kick off geocoding, or image_url to start OCR, or both lat and lon to start reverse geocoding) and submit that as a new enrichment job (or task - need to pick terminology).

The database schema needs to track:

The results of the enrichment must also be recorded - but this is likely out of scope for the enrichments schema itself as different enrichment types will write to different places.

simonw commented 2 years ago

Potential tables:

When a user starts a new enrichment run a record is created in enrichment_run and then one record is created in enrichment_task for every row that they have selected.

simonw commented 2 years ago

Open question: should there be a mechanism by which plugins can optionally cache lookups - such that plugins which are doing things like "resolve this ID against this external source" don't end up processing the same ID more than once?

Those plugins could implement their own cache, but maybe there's value in having an optional centralized cache - a enrichment_cache table with a JSON column for example - so that plugins that run multiple copies of themselves (for parallel execution) have an easy way to co-ordinate their cached values.

I'm going to hold off building this until I work on a plugin that needs it.

simonw commented 1 year ago

An idea that could simplify things a bit: maybe all enrichments write to a shadow table with a naming convention - so if you are enriching blog_entry the results are written to _enrichments_blog_entry - and the two are related by their rowid.

That way enrichments can be joined against the primary table, and everything has a known place to put the data.

The catch is that for some basic enrichments such as geocoding we won't want to do this - we want that latitude and longitude column so we can render the table on a map, without first having to solve the joins-on-the-table-page problem.

simonw commented 1 year ago

I do like the idea of using shadow tables instead of those enrichment_run and enrichment_task tables which have to store table in addition to the rowid though.

So maybe:

simonw commented 1 year ago

I think I can simplify this further: for the first version of this I can get away with just a single table, _eruns_blog_entry - which is the mechanism for tracking runs.

Each run will use a recorded pagination token to track how far through the designated table the enrichment has progressed - so no need to track individual tasks.

So the table will look like this:

_eruns_blog_entry

Note that it will be possible for an enrichment run to end up with a done_count that is higher than the start_count - if the table being enriched had more rows added to it while the enrichment was running. I think that's OK.

simonw commented 1 year ago

I tried a few different names for that table:

So I went with _eruns_X - not 100% happy with that yet either but it will do for the moment.

simonw commented 1 year ago

An OpenAI embeddings enrichment might end up with data that looks like this:

id enrichment configuration created_at filters start_count done_count next completed_at actor_id
1 OpenAIEmbeddings {"column":"embedding"} 2021-01-01T00:00:00Z null 100 50 "abcdefg" null 123
simonw commented 1 year ago

I've changed my mind about having a shadow table for every table that might be enriched.

That idea made sense when I thought the table would hold the results of the enrichments themselves and be frequently joined with the parent table. I don't think it makes sense for bookkeeping though.

So I'll add a table column and rename the table to enrichments_runs.

simonw commented 1 year ago

I think clients can reserve batches of enrichments, which are marked by their start and end cursors - taking advantage of Datasette's implementation of cursor-based pagination.

simonw commented 11 months ago

First draft of schema:

https://github.com/datasette/datasette-enrichments/blob/21432c7de1972524677bd3d0d0d8a3a00c56793c/datasette_enrichments/__init__.py#L15-L26

Still needs done_count and actor_id columns.

simonw commented 11 months ago

Also error_count - error recording in general feels like a good idea. May warrant a whole separate table for logging errors against the row that produced them.

simonw commented 11 months ago

Improved schema design with comments:

create table if not exists _enrichment_jobs (
    id integer primary key,
    status text, -- [p]ending, [r]unning, [c]ancelled, [f]inished
    enrichment text, -- slug of enrichment
    database_name text,
    table_name text,
    filter_querystring text, -- querystring used to filter rows
    config text, -- JSON dictionary of config
    started_at text, -- ISO8601 when added
    finished_at text, -- ISO8601 when completed or cancelled
    cancel_reason text, -- null or reason for cancellation
    next_cursor text, -- next cursor to fetch
    row_count integer, -- number of rows to enrich at start
    error_count integer, -- number of rows with errors encountered
    done_count integer, -- number of rows processed
    actor_id text, -- optional ID of actor who created the job
    cost_100ths_cent integer -- cost of job so far in 1/100ths of a cent
)
simonw commented 11 months ago

Accounting for cost is hard. Even storing cost_100ths_cent may not be right, because e.g. for an embedding you might spend just 15 tokens on a short sentence.

15 tokens at $0.0001 / 1K tokens is just 0.015 100ths of a cent.​​ Do we round up to 1?

simonw commented 10 months ago

I still need the database schema to cover:

simonw commented 10 months ago

I'm going to add two more columns:

simonw commented 10 months ago

The resume_at is particularly relevant to APIs like https://opencagedata.com/api which return rate limit information that includes the time your rate limit will reset.