Open clemenshelm opened 7 years ago
In the Prozess of converting the PDF to an image, there may be a problem. Maybe the text-layer still exist and because of this tesseract has problems reading it. In the bill FQzFuH8puMb7Tb2DC.pdf (mySugar) the prices inside the table are not detected. I took a screenshot of it and run tesseract locally (Verions 3.04.01) and everything worked perfectly. So if we really convert the PDF to an image, it should work.
3EagyvJYF2RJhNTQC.pdf
is another example.
Would be good to test whether making a screenshot and running that through the recognizer produces the same results
The bill ErGHMMEzkorsEiLPQ.pdf is an accessible pdf, so it should not be rotated by rmagick. The text_box dimensions are > 1 BEFORE #332. So the command "img.deskew" with the paramter "auto_crop_width" in rmagick seems to solve a problem we had. Maybe it deletes the layer of text. So it would be interesting what the new version (with #332) does with mySugar bills :)
Many PDF bills contain text, but at the moment we're converting them to images and OCR them. It would be much more reliable to extract the text from the PDF directly, though.