datasette / datasette-enrichments

Tools for running enrichments against data stored in Datasette
https://enrichments.datasette.io
Apache License 2.0
19 stars 0 forks source link

Design the class structure and plugin hook #3

Closed simonw closed 10 months ago

simonw commented 2 years ago

Originally posted by @simonw in https://github.com/simonw/datasette-enrichments/issues/1#issuecomment-1034384356

simonw commented 2 years ago

Some enrichments are going to run entirely in-process - in which case the class itself will implement the code that gets run within Datasette to apply the enrichment.

Others are going to require an external partner via the API in #4.

So the class design should be able to handle both of these cases.

The external ones still need a class, because they need information about what the enrichment is called, how it should be described to the user and what settings (if any) the user can add to an enrichment run - things like the API key to use, and the input columns.

simonw commented 1 year ago

This API design needs to take async into account, since enrichments that call external HTTP APIs might want to do so using httpx in async mode.

simonw commented 1 year ago

I'm going to call the plugin hook register_enrichments because it's likely to end up in Datasette core eventually and I won't want to rename it.

It will look like register_routes() and register_facet_classes().

I think this:

@hookspec
def register_enrichments(datasette):
    """A list of Enrichment subclasses"""
simonw commented 1 year ago

Might be simpler if I enforce ALL enrichment implementations to use async def functions for the actual work that they do.

Based on the table structure in:

id enrichment configuration created_at filters start_count done_count next completed_at actor_id
1 OpenAIEmbeddings {"column":"embedding"} 2021-01-01T00:00:00Z null 100 50 "abcdefg" null 123

This class will have a method that gets called with a batch of rows and Does Stuff to them, then returning information that helps update the done_count column.

simonw commented 1 year ago

I'm going to try to implement this using datasette.client against the existing paginated table API, passing through the filters and next token. Ill use ?_shape=objects (soon to be the default) but only consider the rows and next fields.

simonw commented 1 year ago

Core class method is enrich_batch(db, rows).

Should db be a writable connection? No I think it's a regular database that the method calls write methods on.

simonw commented 1 year ago

Where does the code live that adds the embedding column if it doesn't exist yet? Probably in some kind of initialization method that runs once at the start of the run.

Need to think about how errors will work. They need to be recorded somewhere, ideally the run should continue.

simonw commented 11 months ago

Here's the class structure for my first working OpenAI embeddings prototype:

https://github.com/datasette/datasette-enrichments/blob/06b423b39cc44fdbf70d678b65ec621b13f524a1/datasette_enrichments/__init__.py#L213-L324

simonw commented 11 months ago

Next step: wire up the plugin hook so it actually does something, and rewrite the Uppercase example to use the new WTForms mechanism.

simonw commented 10 months ago

To help test this, I'm going to build a datasette-enrichments/example-enrichments folder full of examples, which in test mode and dev mode can be directly installed.