Tamil Language support needs considerable improvement

venkatarangan commented 4 months ago

Thanks for this release. I would like to share this screenshot of the output from a test run of a Tamil language printed page. The result is unusable. SURYA-Tamil-OCR

VikParuchuri commented 4 months ago

How does it look if you don't pick English?

venkatarangan commented 4 months ago

Issue: I tried with two sample printed pages from different sources, with Languages set only to Tamil. The second test page was a Microsoft Word generated PDF. In both the cases, the Tamil OCR results were unusable.

Possible Bug: I noticed the issue seems to be mainly in the preview image output, the Tamil text that comes out in the JSON seems (JSON-File.JPG below) to be better - I will test on its accuracy later.

Issue (my guess): The bug in the preview output may be in the way it handles Unicode non-latin string, probably text shaping is not happening.

Test Page 1:

Test Page 2:

Test Page 2 - JSON Output:

Thanks

VikParuchuri / surya

Tamil Language support needs considerable improvement #84