datasette / datasette-extract

Import unstructured data (text and images) into structured tables
Apache License 2.0
129 stars 3 forks source link

Support drag and dropping files (including PDFs) #2

Closed simonw closed 4 months ago

simonw commented 4 months ago

I can use PDF.js to support dropping PDFs and extracting their text, which turns out to work pretty well.

Demo here: https://observablehq.com/@simonw/extract-text-content-from-a-pdf

simonw commented 4 months ago

Dragging files on is a great way to blow through the token allowance. Do I care? As long as the user gets a useful error message I think that's OK for the moment.

In the future it might be nice to split their input and submit in multiple batches for them, but that sounds difficult to get right.

simonw commented 4 months ago

Not just PDFs: dragging and dropping in plain text files should work too.

Turns out it's tricky to detect if a file is binary or text, but this hack works I think:

function isValidUtf8(str) {
    const encoder = new TextEncoder();
    const decoder = new TextDecoder();
    const encoded = encoder.encode(str);
    const decoded = decoder.decode(encoded);
    return decoded === str;
}