HassamChundrigar / Urdu-Ocr

Urdu Text Line OCR
25 stars 11 forks source link

Urdu-Ocr doesn't deal with the numbers. #2

Open zeromas opened 4 years ago

zeromas commented 4 years ago

I am trying to OCR the images with numbers as well, Can you guide me how can I include them?

HassamChundrigar commented 4 years ago

I have also included numbers in training images as well, but they are not sufficient, because urdu is written in RTL format while numbers are in LTR.

UBISOFT-1 commented 3 years ago

Why shouldn't we make a urdu number classifier in the image that scrapes the numbers in the test.jpeg and their location in which they will be put in and make another model just for numbers and when we get the numbers put the numbers back into the relative location we got before.

HassamChundrigar commented 3 years ago

@UBISOFT-1 A better approach is to reverse all numbers appearance in data set (text only) then train again.

UBISOFT-1 commented 3 years ago

@HassamChundrigar , yeah that is indeed a better approach, why do not you train on the dataset and maybe make it so it supports multi-line ocr as well.

HassamChundrigar commented 3 years ago

Thanks for highlighting. Because textual data is mainly extracted from magazines stories. There are only few examples of numbers which are not sufficient to train for numeral . There are multiple formats of writing numerals, some uses Arabic letters and some uses Mix Arabic and English letters. Multi line ocr needs segmentation of text lines from document. It may become an another module.