Design database schema - Githubissues

simonw commented 2 years ago

The database schema for how in-progress enrichments (and enrichment progress and results) are recorded

Originally posted by @simonw in https://github.com/simonw/datasette-enrichments/issues/1#issuecomment-1034384356

simonw commented 2 years ago

An enrichment starts with a user kicking one off. They will generally select a set of records (often but not always everything in a table), pick one or more columns (eg address to kick off geocoding, or image_url to start OCR, or both lat and lon to start reverse geocoding) and submit that as a new enrichment job (or task - need to pick terminology).

The database schema needs to track:

What rows still need to be enriched? And with what settings?
Which rows have been completed?
Which rows are reserved and currently being worked on?

The results of the enrichment must also be recorded - but this is likely out of scope for the enrichments schema itself as different enrichment types will write to different places.

simonw commented 2 years ago

Potential tables:

enrichment_type - ID, name - eg RekognitionOCRImage
enrichment_run - ID, table, enrichment_type_id, created_at, created_by, settings (JSON)
enrichment_task - enrichment_run_id, row_pk, status, completed_at, reserved_until
- status can be 1: pending, 2: reserved, 3: complete

When a user starts a new enrichment run a record is created in enrichment_run and then one record is created in enrichment_task for every row that they have selected.

simonw commented 2 years ago

Open question: should there be a mechanism by which plugins can optionally cache lookups - such that plugins which are doing things like "resolve this ID against this external source" don't end up processing the same ID more than once?

Those plugins could implement their own cache, but maybe there's value in having an optional centralized cache - a enrichment_cache table with a JSON column for example - so that plugins that run multiple copies of themselves (for parallel execution) have an easy way to co-ordinate their cached values.

I'm going to hold off building this until I work on a plugin that needs it.

simonw commented 1 year ago

An idea that could simplify things a bit: maybe all enrichments write to a shadow table with a naming convention - so if you are enriching blog_entry the results are written to _enrichments_blog_entry - and the two are related by their rowid.

That way enrichments can be joined against the primary table, and everything has a known place to put the data.

The catch is that for some basic enrichments such as geocoding we won't want to do this - we want that latitude and longitude column so we can render the table on a map, without first having to solve the joins-on-the-table-page problem.

simonw commented 1 year ago

I do like the idea of using shadow tables instead of those enrichment_run and enrichment_task tables which have to store table in addition to the rowid though.

So maybe:

blog_entry - the table
_enrichment_run_blog_entry - tracking the runs against that table
_enrichment_task_blog_entry - tracking which tasks have been completed / are pending

simonw commented 1 year ago

I think I can simplify this further: for the first version of this I can get away with just a single table, _eruns_blog_entry - which is the mechanism for tracking runs.

Each run will use a recorded pagination token to track how far through the designated table the enrichment has progressed - so no need to track individual tasks.

So the table will look like this:

_eruns_blog_entry

id - primary key
enrichment - a text identifier showing which enrichment is running
configuration - JSON with the configuration for that enrichment
created_at - timestamp when it started
fiters - the ?... URL filters against that table that were specified when it started
start_count - a count(*) number run at the start of the operation
done_count - count of how many rows have been processed
next - the pagination text token that will be used for the next page of rows to process
completed_at - a timestamp showing when it was completed
actor_id - the ID of the actor that triggered the run, if applicable

Note that it will be possible for an enrichment run to end up with a done_count that is higher than the start_count - if the table being enriched had more rows added to it while the enrichment was running. I think that's OK.

simonw commented 1 year ago

I tried a few different names for that table:

_enrichments_blog_entry - I might want to use this later to store the results for each enriched row
_enrichments_runs_blog_entry - would clash with a table called runs_blog_entry if it had a _enrichments_X pattern later on
_enrichemntruns_blog_entry - a bit ugly and verbose

So I went with _eruns_X - not 100% happy with that yet either but it will do for the moment.

simonw commented 1 year ago

An OpenAI embeddings enrichment might end up with data that looks like this:

id	enrichment	configuration	created_at	filters	start_count	done_count	next	completed_at	actor_id
1	OpenAIEmbeddings	{"column":"embedding"}	2021-01-01T00:00:00Z	null	100	50	"abcdefg"	null	123

simonw commented 1 year ago

I've changed my mind about having a shadow table for every table that might be enriched.

That idea made sense when I thought the table would hold the results of the enrichments themselves and be frequently joined with the parent table. I don't think it makes sense for bookkeeping though.

So I'll add a table column and rename the table to enrichments_runs.

simonw commented 1 year ago

I think clients can reserve batches of enrichments, which are marked by their start and end cursors - taking advantage of Datasette's implementation of cursor-based pagination.

simonw commented 11 months ago

First draft of schema:

https://github.com/datasette/datasette-enrichments/blob/21432c7de1972524677bd3d0d0d8a3a00c56793c/datasette_enrichments/__init__.py#L15-L26

Still needs done_count and actor_id columns.

simonw commented 11 months ago

Also error_count - error recording in general feels like a good idea. May warrant a whole separate table for logging errors against the row that produced them.

simonw commented 11 months ago

Improved schema design with comments:

create table if not exists _enrichment_jobs (
    id integer primary key,
    status text, -- [p]ending, [r]unning, [c]ancelled, [f]inished
    enrichment text, -- slug of enrichment
    database_name text,
    table_name text,
    filter_querystring text, -- querystring used to filter rows
    config text, -- JSON dictionary of config
    started_at text, -- ISO8601 when added
    finished_at text, -- ISO8601 when completed or cancelled
    cancel_reason text, -- null or reason for cancellation
    next_cursor text, -- next cursor to fetch
    row_count integer, -- number of rows to enrich at start
    error_count integer, -- number of rows with errors encountered
    done_count integer, -- number of rows processed
    actor_id text, -- optional ID of actor who created the job
    cost_100ths_cent integer -- cost of job so far in 1/100ths of a cent
)

simonw commented 11 months ago

Accounting for cost is hard. Even storing cost_100ths_cent may not be right, because e.g. for an embedding you might spend just 15 tokens on a short sentence.

15 tokens at $0.0001 / 1K tokens is just 0.015 100ths of a cent. Do we round up to 1?

simonw commented 10 months ago

I still need the database schema to cover:

recording errors encountered during enrichment runs
maybe budget stuff, see #23

simonw commented 10 months ago

I'm going to add two more columns:

state - a JSON column that runs can optionally use to keep notes while they are running.
resume_at - a ISO8601 timestamp when the run should be resumed. This is designed for use with plugins that manage their own rate limits and can occasionally pause themselves to resume later. I imagine this be read on server startup to see if there are any plugins that will need resuming - most of the time the resumption will be triggered by a asyncio.sleep().

simonw commented 10 months ago

The resume_at is particularly relevant to APIs like https://opencagedata.com/api which return rate limit information that includes the time your rate limit will reset.

datasette / datasette-enrichments

Design database schema #2