Open simonw opened 2 years ago
An enrichment starts with a user kicking one off. They will generally select a set of records (often but not always everything in a table), pick one or more columns (eg address
to kick off geocoding, or image_url
to start OCR, or both lat
and lon
to start reverse geocoding) and submit that as a new enrichment job (or task - need to pick terminology).
The database schema needs to track:
The results of the enrichment must also be recorded - but this is likely out of scope for the enrichments schema itself as different enrichment types will write to different places.
Potential tables:
enrichment_type
- ID, name - eg RekognitionOCRImageenrichment_run
- ID, table, enrichment_type_id, created_at, created_by, settings (JSON)enrichment_task
- enrichment_run_id, row_pk, status, completed_at, reserved_until
When a user starts a new enrichment run a record is created in enrichment_run
and then one record is created in enrichment_task
for every row that they have selected.
Open question: should there be a mechanism by which plugins can optionally cache lookups - such that plugins which are doing things like "resolve this ID against this external source" don't end up processing the same ID more than once?
Those plugins could implement their own cache, but maybe there's value in having an optional centralized cache - a enrichment_cache
table with a JSON column for example - so that plugins that run multiple copies of themselves (for parallel execution) have an easy way to co-ordinate their cached values.
I'm going to hold off building this until I work on a plugin that needs it.
An idea that could simplify things a bit: maybe all enrichments write to a shadow table with a naming convention - so if you are enriching blog_entry
the results are written to _enrichments_blog_entry
- and the two are related by their rowid
.
That way enrichments can be joined against the primary table, and everything has a known place to put the data.
The catch is that for some basic enrichments such as geocoding we won't want to do this - we want that latitude
and longitude
column so we can render the table on a map, without first having to solve the joins-on-the-table-page problem.
I do like the idea of using shadow tables instead of those enrichment_run
and enrichment_task
tables which have to store table
in addition to the rowid though.
So maybe:
blog_entry
- the table_enrichment_run_blog_entry
- tracking the runs against that table_enrichment_task_blog_entry
- tracking which tasks have been completed / are pendingI think I can simplify this further: for the first version of this I can get away with just a single table, _eruns_blog_entry
- which is the mechanism for tracking runs.
Each run will use a recorded pagination token to track how far through the designated table the enrichment has progressed - so no need to track individual tasks.
So the table will look like this:
_eruns_blog_entry
id
- primary keyenrichment
- a text identifier showing which enrichment is runningconfiguration
- JSON with the configuration for that enrichmentcreated_at
- timestamp when it startedfiters
- the ?...
URL filters against that table that were specified when it startedstart_count
- a count(*)
number run at the start of the operationdone_count
- count of how many rows have been processednext
- the pagination text token that will be used for the next page of rows to processcompleted_at
- a timestamp showing when it was completedactor_id
- the ID of the actor that triggered the run, if applicableNote that it will be possible for an enrichment run to end up with a done_count
that is higher than the start_count
- if the table being enriched had more rows added to it while the enrichment was running. I think that's OK.
I tried a few different names for that table:
_enrichments_blog_entry
- I might want to use this later to store the results for each enriched row_enrichments_runs_blog_entry
- would clash with a table called runs_blog_entry
if it had a _enrichments_X
pattern later on_enrichemntruns_blog_entry
- a bit ugly and verboseSo I went with _eruns_X
- not 100% happy with that yet either but it will do for the moment.
An OpenAI embeddings enrichment might end up with data that looks like this:
id | enrichment | configuration | created_at | filters | start_count | done_count | next | completed_at | actor_id |
---|---|---|---|---|---|---|---|---|---|
1 | OpenAIEmbeddings | {"column":"embedding"} | 2021-01-01T00:00:00Z | null | 100 | 50 | "abcdefg" | null | 123 |
I've changed my mind about having a shadow table for every table that might be enriched.
That idea made sense when I thought the table would hold the results of the enrichments themselves and be frequently joined with the parent table. I don't think it makes sense for bookkeeping though.
So I'll add a table
column and rename the table to enrichments_runs
.
I think clients can reserve batches of enrichments, which are marked by their start and end cursors - taking advantage of Datasette's implementation of cursor-based pagination.
First draft of schema:
Still needs done_count
and actor_id
columns.
Also error_count
- error recording in general feels like a good idea. May warrant a whole separate table for logging errors against the row that produced them.
Improved schema design with comments:
create table if not exists _enrichment_jobs (
id integer primary key,
status text, -- [p]ending, [r]unning, [c]ancelled, [f]inished
enrichment text, -- slug of enrichment
database_name text,
table_name text,
filter_querystring text, -- querystring used to filter rows
config text, -- JSON dictionary of config
started_at text, -- ISO8601 when added
finished_at text, -- ISO8601 when completed or cancelled
cancel_reason text, -- null or reason for cancellation
next_cursor text, -- next cursor to fetch
row_count integer, -- number of rows to enrich at start
error_count integer, -- number of rows with errors encountered
done_count integer, -- number of rows processed
actor_id text, -- optional ID of actor who created the job
cost_100ths_cent integer -- cost of job so far in 1/100ths of a cent
)
Accounting for cost is hard. Even storing cost_100ths_cent
may not be right, because e.g. for an embedding you might spend just 15 tokens on a short sentence.
15 tokens at $0.0001 / 1K tokens is just 0.015 100ths of a cent. Do we round up to 1?
I still need the database schema to cover:
I'm going to add two more columns:
state
- a JSON column that runs can optionally use to keep notes while they are running.resume_at
- a ISO8601 timestamp when the run should be resumed. This is designed for use with plugins that manage their own rate limits and can occasionally pause themselves to resume later. I imagine this be read on server startup to see if there are any plugins that will need resuming - most of the time the resumption will be triggered by a asyncio.sleep()
.The resume_at
is particularly relevant to APIs like https://opencagedata.com/api which return rate limit information that includes the time your rate limit will reset.
Originally posted by @simonw in https://github.com/simonw/datasette-enrichments/issues/1#issuecomment-1034384356