Closed atalw closed 3 months ago
hey thanks for this! i think it will be useful to have the file names indicate where it was taken from rather than what it is about. Also I think gpt4o will be MUCH better for OCR; but I'm good to merge this now!
The filenames are the same as the PDF names that are on Chadnet. And unfortunately, GPT-4o doesn't do a good job of OCR. It hallucinates, skips paragraphs, and reorders them. Surya is actually SOTA for this narrow task.
can you give me an example of a document that GPT-4o has a hard time with? The ones i've tried so far have been perfect.
Just tried one: 'Adaptogenic Milk' with a prompt: "Can you transcribe this document? Include all pages, order the paragraphs correctly, and do not make up new information."
It was rewording sentences, making up new ones, changed paragraph order, and didn't complete the doc either, probably because of context limits. Perhaps the prompt could be improved?
Just tried it with "Adaptogenic Milk", first page was perfect.
https://chat.openai.com/share/b2c9b554-a4a7-4d33-8427-5b03f534d202
I used this prompt
please extract the text of this document, output in markdown format
it has two columns
Here's a comparison, the original is copy/pasted directly from the PDF, the modified is from gpt4o
With my prompt it gets messed up later in the document and subtly in various places earlier. It didn't complete the doc either and made up the last paragraph. Here's the diff from the doc just merged (spacing adjusted) vs GPT.
I tried it with your prompt and it does much better. Maybe requesting a markdown output was the trick. Here's the diff. https://chat.openai.com/share/e1998e57-f21d-4357-af42-f74768f6abb3
IMO, I'm more confident in Surya as it won't hallucinate but it's nice to have the actual spacing and punctation that GPT outputs. This document itself is really clean so both of these outputs are very similar but I wonder how they'd fair up against a document with more complications like handwritten notes, markings, ads, etc.
Used Surya OCR for all the conversions. I think there is around 90% accuracy, so it's still not perfect and may need manual clean ups for many docs.