Closed simonw closed 3 months ago
Assorted ideas:
Most basic version: you select an existing table (hence avoiding the need to implement a schema editing tool) and paste text into a textarea. I'll build that first.
It's going to need a description for each column - it can guess in some cases, but the option to give it clues will help a lot.
I got this working, but it was really slow... because the OpenAI APIs take a while to stream back all of that JSON.
I had a note about that https://til.simonwillison.net/gpt3/openai-python-functions-data-extraction where I mentioned that maybe ijson
could help with that.
So I spent some time and figured out the ijson
recipe for it, described in a new TIL: https://til.simonwillison.net/json/ijson-stream
Short version:
events = ijson.sendable_list()
coro = ijson.items_coro(events, "items.item")
seen_events = set()
for chunk in chunks:
coro.send(chunk.encode("utf-8"))
if events:
# Any we have not seen yet?
unseen_events = [e for e in events if json.dumps(e) not in seen_events]
if unseen_events:
for event in unseen_events:
seen_events.add(json.dumps(event))
print(json.dumps(event))
time.sleep(1)
The goal of this plugin is to provide a UI for extracting structured data from unstructured text, using the trick described in https://til.simonwillison.net/gpt3/openai-python-functions-data-extraction
Datasette is all about tables, so a plugin which makes it as easy as possible to turn unstructured data into table data makes a ton of sense.