Closed simonw closed 10 months ago
Some enrichments are going to run entirely in-process - in which case the class itself will implement the code that gets run within Datasette to apply the enrichment.
Others are going to require an external partner via the API in #4.
So the class design should be able to handle both of these cases.
The external ones still need a class, because they need information about what the enrichment is called, how it should be described to the user and what settings (if any) the user can add to an enrichment run - things like the API key to use, and the input columns.
This API design needs to take async into account, since enrichments that call external HTTP APIs might want to do so using httpx
in async mode.
I'm going to call the plugin hook register_enrichments
because it's likely to end up in Datasette core eventually and I won't want to rename it.
It will look like register_routes()
and register_facet_classes()
.
I think this:
@hookspec
def register_enrichments(datasette):
"""A list of Enrichment subclasses"""
Might be simpler if I enforce ALL enrichment implementations to use async def
functions for the actual work that they do.
Based on the table structure in:
id | enrichment | configuration | created_at | filters | start_count | done_count | next | completed_at | actor_id |
---|---|---|---|---|---|---|---|---|---|
1 | OpenAIEmbeddings | {"column":"embedding"} | 2021-01-01T00:00:00Z | null | 100 | 50 | "abcdefg" | null | 123 |
This class will have a method that gets called with a batch of rows and Does Stuff to them, then returning information that helps update the done_count
column.
I'm going to try to implement this using datasette.client
against the existing paginated table API, passing through the filters and next token. Ill use ?_shape=objects
(soon to be the default) but only consider the rows
and next
fields.
Core class method is enrich_batch(db, rows)
.
Should db
be a writable connection? No I think it's a regular database that the method calls write methods on.
Where does the code live that adds the embedding
column if it doesn't exist yet? Probably in some kind of initialization method that runs once at the start of the run.
Need to think about how errors will work. They need to be recorded somewhere, ideally the run should continue.
Here's the class structure for my first working OpenAI embeddings prototype:
Next step: wire up the plugin hook so it actually does something, and rewrite the Uppercase example to use the new WTForms mechanism.
To help test this, I'm going to build a datasette-enrichments/example-enrichments
folder full of examples, which in test mode and dev mode can be directly installed.
Originally posted by @simonw in https://github.com/simonw/datasette-enrichments/issues/1#issuecomment-1034384356