VikParuchuri / surya

OCR, layout analysis, reading order, line detection in 90+ languages
https://www.datalab.to
GNU General Public License v3.0
9.77k stars 631 forks source link

Edge high quality case (skips numbers) #153

Open mmacvicar opened 1 month ago

mmacvicar commented 1 month ago

I am dealing with this edge case where an excellent quality image is almost perfectly recognized, but surya will skip the number 4 every single time. I tested different bounding boxes, zoom levels and different types of noise but none made any difference. The detector is working fine but the recognizer reads R$33,00 instead of R$334,00 if I specify only portuguese. However, if I specify portuguese and english, it comes out correctly.

How would you reason about the influence of adding english when recognizing numbers? Is there anything particular in the training data that would make reasonable to always add english?

1_z0 1_z0_0_bbox

surya_ocr 1_z0.png --images --langs pt

{"1_z0": [{"text_lines": [{"polygon": [[0.0, 35.0], [1409.0, 35.0], [1409.0, 80.0], [0.0, 80.0]], "confidence": 0.9638850092887878, "text": " Valor: R$33,00 (TREZENTOS E TRINTA E QUATRO REAIS) ", "bbox": [0.0, 35.0, 1409.0, 80.0]}], "languages": ["pt"], "image_bbox": [0.0, 0.0, 1411.0, 116.0], "page": 1}]}

1_z1 1_z1_0_bbox

surya_ocr 1_z1.png --images --langs pt

{"1_z1": [{"text_lines": [{"polygon": [[1.0, 22.0], [605.0, 28.0], [597.0, 93.0], [0.0, 87.0]], "confidence": 0.9626762866973877, "text": "Valor: R$33,00", "bbox": [1.0, 22.0, 605.0, 93.0]}], "languages": ["pt"], "image_bbox": [0.0, 0.0, 605.0, 131.0], "page": 1}]}
hvaz commented 1 month ago

I am having the same issue

israelsaba commented 1 month ago

Me too