datasette / datasette-enrichments

Tools for running enrichments against data stored in Datasette
https://enrichments.datasette.io
Apache License 2.0
20 stars 0 forks source link

Initial design for this plugin #1

Closed simonw closed 1 year ago

simonw commented 2 years ago

This plugin will work by providing its own plugin hook that can be used to register "enrichments" - classes that can enrich data in some way, for example:

Each of these enrichments will itself be a plugin. The datasette-enrichments plugin will be responsible for tracking which enrichments are to run against which columns and tracking progress along the way.

Crucially, many enrichment implementations will be expected to run as separate processes - so this plugin will offer an API that external enrichment processes can use to ask "what do I need to do?" and to then record their results back to the Datasette instance.

simonw commented 2 years ago

It would be interesting if this mechanism could handle human-powered enrichments too - after all, saying "run OCR against everything in this column and write the discovered text back to this other column" isn't really any different from saying "ask a human being to type in the text from this image". They can work from the same APIs!

simonw commented 2 years ago

The main things that need to be designed then are:

simonw commented 1 year ago

I'm inclined to say that enrichments that want to work in parallel should implement that themselves - so a job can only be worked on by a single worker, but that worker is welcome to grab a batch of 100 items at once and execute a massively parallel architecture of some sort to crunch through that batch as fast as possible.

Or grab 10x100 batches and process 1000 in parallel.

That at I can outsource managing that parallelism and keep the core mechanism in Datasette as simple as possible.

simonw commented 1 year ago

The prototype now successfully handles an embedding run against OpenAI! It needs a bunch of tidying up but it's looking very promising.

Here's the table after the demo run completed:

CleanShot 2023-11-05 at 21 22 41@2x

Persisting the OpenAI API key like that is clearly not good.

I'm also not convinced I got the cost calculation right - I think rounding is throwing away too much information.

simonw commented 1 year ago

I'm not sure which of these was that run:

CleanShot 2023-11-05 at 21 24 15@2x

$0.0001 / 1K tokens for 916,000 tokens is 9c so actually yeah I think I got it right, or at least close enough.

simonw commented 1 year ago

Thoughts on the API token problem: