0x2447196 / raypeatarchive

20 stars 7 forks source link

Newsletters #19

Closed atalw closed 3 months ago

atalw commented 3 months ago

Used Surya OCR for all the conversions. I think there is around 90% accuracy, so it's still not perfect and may need manual clean ups for many docs.

0x2447196 commented 3 months ago

hey thanks for this! i think it will be useful to have the file names indicate where it was taken from rather than what it is about. Also I think gpt4o will be MUCH better for OCR; but I'm good to merge this now!

atalw commented 3 months ago

The filenames are the same as the PDF names that are on Chadnet. And unfortunately, GPT-4o doesn't do a good job of OCR. It hallucinates, skips paragraphs, and reorders them. Surya is actually SOTA for this narrow task.

0x2447196 commented 3 months ago

can you give me an example of a document that GPT-4o has a hard time with? The ones i've tried so far have been perfect.

atalw commented 3 months ago

Just tried one: 'Adaptogenic Milk' with a prompt: "Can you transcribe this document? Include all pages, order the paragraphs correctly, and do not make up new information."

It was rewording sentences, making up new ones, changed paragraph order, and didn't complete the doc either, probably because of context limits. Perhaps the prompt could be improved?

0x2447196 commented 3 months ago

Just tried it with "Adaptogenic Milk", first page was perfect.

https://chat.openai.com/share/b2c9b554-a4a7-4d33-8427-5b03f534d202

I used this prompt

please extract the text of this document, output in markdown format
it has two columns

Here's a comparison, the original is copy/pasted directly from the PDF, the modified is from gpt4o

https://www.diffchecker.com/9ANjq1Gv/

atalw commented 3 months ago

With my prompt it gets messed up later in the document and subtly in various places earlier. It didn't complete the doc either and made up the last paragraph. Here's the diff from the doc just merged (spacing adjusted) vs GPT.

I tried it with your prompt and it does much better. Maybe requesting a markdown output was the trick. Here's the diff. https://chat.openai.com/share/e1998e57-f21d-4357-af42-f74768f6abb3

IMO, I'm more confident in Surya as it won't hallucinate but it's nice to have the actual spacing and punctation that GPT outputs. This document itself is really clean so both of these outputs are very similar but I wonder how they'd fair up against a document with more complications like handwritten notes, markings, ads, etc.

atalw commented 3 months ago

Since we just cleaned Adaptogenic Milk and checked it with source, I raised a PR here. Just improved spacing and added a few commas, so i think we can be confident that the current transcriptions are good enough mostly, though I can't guarantee.