VikParuchuri / surya

OCR, layout analysis, reading order, table recognition in 90+ languages
https://www.datalab.to
GNU General Public License v3.0
14.03k stars 870 forks source link

Urdu Text Does Not Get Detected #6

Closed Aeyxen closed 10 months ago

Aeyxen commented 10 months ago

First things first, sincere appreciation for your outstanding work in developing this incredible AI-driven OCR library. It's a fantastic tool that holds immense potential for digital humanities, I am a student of this subject.

I started my testing with some old Urdu historical documents, and unfortunately, I didn't observe any bounding box (Bbox) detection for the Urdu text within those documents.

Subsequently, I tested it with an image that contains a mix of Hindi, English, and Urdu text. To my delight, it successfully detected the Hindi and English portions of the text. However, it only recognized one line of the Urdu text, which was less than expected. I have attached the image for your reference so that you can better understand the scenario.

image5-602w291h_0_bbox

VikParuchuri commented 10 months ago

Try the new code/model - pip install -U surya

VikParuchuri commented 10 months ago

This seems to work

image

and

image

You may need to experiment with the threshold settings to detect more text (see README)

Aeyxen commented 8 months ago

Noted with thanks