Initial design for this plugin

datasette / datasette-enrichments

Tools for running enrichments against data stored in Datasette

https://enrichments.datasette.io

Apache License 2.0

20 stars 0 forks source link

Initial design for this plugin #1

Closed simonw closed 1 year ago

simonw commented 2 years ago

This plugin will work by providing its own plugin hook that can be used to register "enrichments" - classes that can enrich data in some way, for example:

Geocoding text addresses and storing the resulting latitude/longitude
Reverse geocoding a latitude/longitude into a place description
Running OCR against a linked image
Generating a transcript of a linked audio recording
... and much more

Each of these enrichments will itself be a plugin. The datasette-enrichments plugin will be responsible for tracking which enrichments are to run against which columns and tracking progress along the way.

Crucially, many enrichment implementations will be expected to run as separate processes - so this plugin will offer an API that external enrichment processes can use to ask "what do I need to do?" and to then record their results back to the Datasette instance.

simonw commented 2 years ago

It would be interesting if this mechanism could handle human-powered enrichments too - after all, saying "run OCR against everything in this column and write the discovered text back to this other column" isn't really any different from saying "ask a human being to type in the text from this image". They can work from the same APIs!

simonw commented 2 years ago

The main things that need to be designed then are:

The database schema for how in-progress enrichments (and enrichment progress and results) are recorded
The class structure that plugins will use to implement their own custom enrichments
The JSON API that external enrichments will use to find out what they need to do and record their results
The user interface to allow Datasette users to kick off the enrichment process against tables and columns in their Datasette instance

simonw commented 1 year ago

I'm inclined to say that enrichments that want to work in parallel should implement that themselves - so a job can only be worked on by a single worker, but that worker is welcome to grab a batch of 100 items at once and execute a massively parallel architecture of some sort to crunch through that batch as fast as possible.

Or grab 10x100 batches and process 1000 in parallel.

That at I can outsource managing that parallelism and keep the core mechanism in Datasette as simple as possible.

simonw commented 1 year ago

The prototype now successfully handles an embedding run against OpenAI! It needs a bunch of tidying up but it's looking very promising.

Here's the table after the demo run completed:

CleanShot 2023-11-05 at 21 22 41@2x

Persisting the OpenAI API key like that is clearly not good.

I'm also not convinced I got the cost calculation right - I think rounding is throwing away too much information.

simonw commented 1 year ago

I'm not sure which of these was that run:

$0.0001 / 1K tokens for 916,000 tokens is 9c so actually yeah I think I got it right, or at least close enough.

simonw commented 1 year ago

Thoughts on the API token problem:

That token could be configured as a secret, at which point it's not needed here at all
The table can be hidden in the new _internal database inside Datasette, not exposed to users
Enrichment classes could have the option to run extra code at the end of their run - they could use that to delete any secrets from their configuration
They could also use a custom WTForm field which two-way-encrypts tokens such that the encrypted token is visible in the database but cannot be read