datasette / datasette-extract

Import unstructured data (text and images) into structured tables
Apache License 2.0
143 stars 3 forks source link

Run import tasks in the background and show people a progress indicator while they run #4

Closed simonw closed 7 months ago

simonw commented 7 months ago

Currently blocks for several seconds while the job runs, which isn't ideal.

simonw commented 7 months ago

Can't easily show a progress indicator in terms of 0-100%, but showing parsed results as they are extracted would be neat.

simonw commented 7 months ago

Two ways to do this:

I like the table method, since that makes things easier to debug in the future. It could also be used to help power future features like remembering the last used column descriptions.

simonw commented 7 months ago

Database table will be similar to the one used for Datasette enrichments: https://github.com/datasette/datasette-enrichments/blob/eeded4173fff0614e2efe559022a9023e5ff8640/datasette_enrichments/__init__.py#L35-L54

create table if not exists _enrichment_jobs (
    id integer primary key,
    status text, -- pending, running, cancelled, finished
    enrichment text, -- slug of enrichment
    database_name text,
    table_name text,
    filter_querystring text, -- querystring used to filter rows
    config text, -- JSON dictionary of config
    started_at text, -- ISO8601 when added
    finished_at text, -- ISO8601 when completed or cancelled
    cancel_reason text, -- null or reason for cancellation
    next_cursor text, -- next cursor to fetch
    row_count integer, -- number of rows to enrich at start
    error_count integer, -- number of rows with errors encountered
    done_count integer, -- number of rows processed
    actor_id text, -- optional ID of actor who created the job
    cost_100ths_cent integer -- cost of job so far in 1/100ths of a cent
)
simonw commented 7 months ago

I'm torn on whether I should persist the full input text to the model. I think yes, for debuggability.

simonw commented 7 months ago

Maybe this as a starting point:

create table if not exists _extract_jobs (
    id integer primary key,
    database_name text,
    table_name text,
    input text,
    custom_prompt text, -- for custom system prompt
    columns text, -- JSON function definition
    started_at text, -- ISO8601 when added
    finished_at text, -- ISO8601 when completed or cancelled
    row_count integer, -- starts at 0
    error text
);
simonw commented 7 months ago

Options for polling for extracted records to display to the user:

  1. Poll the database table that is being populated (though that's tricky as we don't want to show any previously existing records - could use rowid for inserted ones but that won't help for replace-by-pk if we implement that)
  2. Poll against an in-memory data structure that exists just for the duration of the task
  3. Append the JSON to either a column in _extract_jobs or a separate table and poll that
  4. Instead of polling use WebSockets or SSE

I'm going to try option 2 first because I think it will be easiest to get working - can very easily change how that works later.

simonw commented 7 months ago

I'm going to display the extracted data to the user as a raw JSON array - I considered a more human-friendly format like a table display or definition list, but I think JSON is a better representation of what's actually happening plus it gives them the chance to copy and paste it out to another tool if they are so inclined.

simonw commented 7 months ago

Should the rows be inserted one at a time as they are extracted, or should they all be inserted at once at the end of the operation?

I'll try one at a time first and see how that feels,

simonw commented 7 months ago

I generated a new loading indicator using https://loading.io/ (unrestricted public domain license).

simonw commented 7 months ago

Demo:

projects-demo