getomni-ai / zerox

PDF to Markdown with vision models
https://getomni.ai/ocr-demo
MIT License
6.59k stars 358 forks source link

Files with little text emit a description rather than just the text #66

Open OliverWales opened 1 month ago

OliverWales commented 1 month ago

I have a debug PDF I use to test cropping to specific text ranges. It consists of some textboxes with the text A1 to A4. When I pass it to Xerox it returns the following description: The page is blank except for the labels in the corners: A1, A2, A3, and A4., rather than the text contents A1 A2 A3 A4.

image
tylermaran commented 1 month ago

I've also noticed this with pages that are blank. Which we can probably fix with the tesseract pre-process step.

For this one, probably a prompt change...