VikParuchuri / surya

OCR, layout analysis, reading order, line detection in 90+ languages
https://www.datalab.to
GNU General Public License v3.0
9.77k stars 632 forks source link

Tamil Language support needs considerable improvement #84

Open venkatarangan opened 4 months ago

venkatarangan commented 4 months ago

Thanks for this release. I would like to share this screenshot of the output from a test run of a Tamil language printed page. The result is unusable. SURYA-Tamil-OCR

VikParuchuri commented 4 months ago

How does it look if you don't pick English?

venkatarangan commented 4 months ago

Issue: I tried with two sample printed pages from different sources, with Languages set only to Tamil. The second test page was a Microsoft Word generated PDF. In both the cases, the Tamil OCR results were unusable.

Possible Bug: I noticed the issue seems to be mainly in the preview image output, the Tamil text that comes out in the JSON seems (JSON-File.JPG below) to be better - I will test on its accuracy later.

Issue (my guess): The bug in the preview output may be in the way it handles Unicode non-latin string, probably text shaping is not happening.

Test Page 1:

Output1

Test Page 2:

Output2

Test Page 2 - JSON Output:

json-file

Thanks