clemenshelm / chillbill-recognizer

0 stars 0 forks source link

Read PDF bills directly #62

Open clemenshelm opened 7 years ago

clemenshelm commented 7 years ago

Many PDF bills contain text, but at the moment we're converting them to images and OCR them. It would be much more reliable to extract the text from the PDF directly, though.

Thomas1e commented 7 years ago

In the Prozess of converting the PDF to an image, there may be a problem. Maybe the text-layer still exist and because of this tesseract has problems reading it. In the bill FQzFuH8puMb7Tb2DC.pdf (mySugar) the prices inside the table are not detected. I took a screenshot of it and run tesseract locally (Verions 3.04.01) and everything worked perfectly. So if we really convert the PDF to an image, it should work.

tamacodechi commented 7 years ago

3EagyvJYF2RJhNTQC.pdf is another example.

Would be good to test whether making a screenshot and running that through the recognizer produces the same results

Thomas1e commented 7 years ago

The bill ErGHMMEzkorsEiLPQ.pdf is an accessible pdf, so it should not be rotated by rmagick. The text_box dimensions are > 1 BEFORE #332. So the command "img.deskew" with the paramter "auto_crop_width" in rmagick seems to solve a problem we had. Maybe it deletes the layer of text. So it would be interesting what the new version (with #332) does with mySugar bills :)