This PR makes a few changes to node-zerox with the goal of improving overall accuracy of OCR.
The resolution of image representing PDF page has been increased from 1056 to 2048.
Auto orientation correction is now configurable. It can be turned off by setting correctOrientation as false.
Low confidence Tesseract orientation suggestions will be ignored. This is non configurable, and set to 60%.
Whitespace around the page serves no purpose in OCR, trimming it away can allow more pixels to show actual content. This is turned on by default and can be switched off by setting trimEdges as false.
After step 4, to make sure the content fills up the max height of 2048, we make an educated guess and extract another image of the same page but with increased height, so that after the whitespace trimming, the image is close to 2048px in height. This will have most impact in pages with lots of white space, and small-sized content bunched together on one side.
This PR makes a few changes to
node-zerox
with the goal of improving overall accuracy of OCR.1056
to2048
.correctOrientation
asfalse
.trimEdges
asfalse
.2048
, we make an educated guess and extract another image of the same page but with increased height, so that after the whitespace trimming, the image is close to2048px
in height. This will have most impact in pages with lots of white space, and small-sized content bunched together on one side.