getomni-ai / zerox

PDF to Markdown with vision models
https://getomni.ai/ocr-demo
MIT License
6.58k stars 358 forks source link

Improve accuracy #80

Open ZeeshanZulfiqarAli opened 3 weeks ago

ZeeshanZulfiqarAli commented 3 weeks ago

This PR makes a few changes to node-zerox with the goal of improving overall accuracy of OCR.

  1. The resolution of image representing PDF page has been increased from 1056 to 2048.
  2. Auto orientation correction is now configurable. It can be turned off by setting correctOrientation as false.
  3. Low confidence Tesseract orientation suggestions will be ignored. This is non configurable, and set to 60%.
  4. Whitespace around the page serves no purpose in OCR, trimming it away can allow more pixels to show actual content. This is turned on by default and can be switched off by setting trimEdges as false.
  5. After step 4, to make sure the content fills up the max height of 2048, we make an educated guess and extract another image of the same page but with increased height, so that after the whitespace trimming, the image is close to 2048px in height. This will have most impact in pages with lots of white space, and small-sized content bunched together on one side.