Closed simonw closed 1 year ago
It would be interesting if this mechanism could handle human-powered enrichments too - after all, saying "run OCR against everything in this column and write the discovered text back to this other column" isn't really any different from saying "ask a human being to type in the text from this image". They can work from the same APIs!
The main things that need to be designed then are:
I'm inclined to say that enrichments that want to work in parallel should implement that themselves - so a job can only be worked on by a single worker, but that worker is welcome to grab a batch of 100 items at once and execute a massively parallel architecture of some sort to crunch through that batch as fast as possible.
Or grab 10x100 batches and process 1000 in parallel.
That at I can outsource managing that parallelism and keep the core mechanism in Datasette as simple as possible.
The prototype now successfully handles an embedding run against OpenAI! It needs a bunch of tidying up but it's looking very promising.
Here's the table after the demo run completed:
Persisting the OpenAI API key like that is clearly not good.
I'm also not convinced I got the cost calculation right - I think rounding is throwing away too much information.
I'm not sure which of these was that run:
$0.0001 / 1K tokens for 916,000 tokens is 9c so actually yeah I think I got it right, or at least close enough.
Thoughts on the API token problem:
_internal
database inside Datasette, not exposed to users
This plugin will work by providing its own plugin hook that can be used to register "enrichments" - classes that can enrich data in some way, for example:
Each of these enrichments will itself be a plugin. The
datasette-enrichments
plugin will be responsible for tracking which enrichments are to run against which columns and tracking progress along the way.Crucially, many enrichment implementations will be expected to run as separate processes - so this plugin will offer an API that external enrichment processes can use to ask "what do I need to do?" and to then record their results back to the Datasette instance.