datasette / datasette-extract

Import unstructured data (text and images) into structured tables
Apache License 2.0
129 stars 3 forks source link

Initial plugin design #1

Closed simonw closed 3 months ago

simonw commented 10 months ago

The goal of this plugin is to provide a UI for extracting structured data from unstructured text, using the trick described in https://til.simonwillison.net/gpt3/openai-python-functions-data-extraction

Datasette is all about tables, so a plugin which makes it as easy as possible to turn unstructured data into table data makes a ton of sense.

simonw commented 10 months ago

Assorted ideas:

simonw commented 10 months ago

Most basic version: you select an existing table (hence avoiding the need to implement a schema editing tool) and paste text into a textarea. I'll build that first.

simonw commented 10 months ago

It's going to need a description for each column - it can guess in some cases, but the option to give it clues will help a lot.

simonw commented 10 months ago

I got this working, but it was really slow... because the OpenAI APIs take a while to stream back all of that JSON.

I had a note about that https://til.simonwillison.net/gpt3/openai-python-functions-data-extraction where I mentioned that maybe ijson could help with that.

So I spent some time and figured out the ijson recipe for it, described in a new TIL: https://til.simonwillison.net/json/ijson-stream

Short version:

events = ijson.sendable_list()
coro = ijson.items_coro(events, "items.item")

seen_events = set()

for chunk in chunks:
    coro.send(chunk.encode("utf-8"))
    if events:
        # Any we have not seen yet?
        unseen_events = [e for e in events if json.dumps(e) not in seen_events]
        if unseen_events:
            for event in unseen_events:
                seen_events.add(json.dumps(event))
                print(json.dumps(event))
                time.sleep(1)