Training Opensource Tesseract OCR for Tamil

balajijagadesh commented 5 years ago

https://github.com/tesseract-ocr/tesseract Tesseract is an open source ocr which is used by wikisource in different languages such as english, polish, french, Bengali etc.

Recently Tesseract OCR is tested for Tamil language in Tamil wikisource. Any one can test Tesseract OCR by adding the following code to the their common.js page in Tamil wikisource.

mw.loader.load( '//wikisource.org/w/index.php?title=User:Putnik/TesseractOCR.js&action=raw&ctype=text/javascript' );

Example to add this code is shown here https://ta.wikisource.org/w/index.php?title=%E0%AE%AA%E0%AE%AF%E0%AE%A9%E0%AE%B0%E0%AF%8D:Balajijagadesh/common.js&oldid=1013534

After adding this code, we can see an Tesseract OCR button in Page namespace (பக்கம் பெயர்வெளி) edit mode in Tamil wikisource. Example is shown in the image below.

An initial test has shown below satisfactory results of the ocr when compared to the google OCR output.

But the good news is that the OCR is opensource and can be trained. The training of OCR can be done at https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

It would be useful to train this ocr with lot more data.

Moreover, the google OCR is not good for old tamil scripts such as றை, னை, றா, ணா etc. So if we can train such old scripts in this tesseract ocr then it would be useful to proofread old Tamil scripts in Tamil wikisource.

Also once the ocr is enriched then it can be used by anyone and build application on top of this.

ravi-annaswamy commented 5 years ago

OCR of entire book using Google Cloud Vision Text API: Can be used for ground truth training if needed.

Original book: https://ta.wikisource.org/s/1jz9

OCR results: valliappa_neela_mala-ocred.txt

ravi-annaswamy commented 5 years ago

Shree if needed I can share some page images of this book and reviewed page text on my git, for evaulation purpose as ground truth.

Shreeshrii commented 5 years ago

Shree if needed I can share some page images of this book and reviewed page text on my git, for evaulation purpose as ground truth.

That would be great.

Shreeshrii commented 5 years ago

Can someone try google ocr on that?

If you give the exact URL/page number then I can try it.

If you have a wikipedia user account, you can add Google OCR to your user's common.js

replace XXX by your username in the link below,

https://ta.wikisource.org/w/index.php?title=%E0%AE%AA%E0%AE%AF%E0%AE%A9%E0%AE%B0%E0%AF%8D:XXX/common.js

and add the following:


//Google OCR
mw.loader.load('//wikisource.org/w/index.php?title=MediaWiki:GoogleOCR.js&action=raw&ctype=text/javascript');

Shreeshrii commented 5 years ago

I OCRed this page using google cloud text api and found no errors. So my guess is if we switch to google ocr, we should be good even now. I think google OCR has improved since this page was OCRed last time.

Ravi, I saw the page number (16) on the earlier post and tried Google OCR just now. For comparison, I also posted your OCR results. There are differences between the two. Please see:

https://ta.wikisource.org/w/index.php?title=%E0%AE%AA%E0%AE%95%E0%AF%8D%E0%AE%95%E0%AE%AE%E0%AF%8D%3A%E0%AE%A8%E0%AF%80%E0%AE%B2%E0%AE%BE_%E0%AE%AE%E0%AE%BE%E0%AE%B2%E0%AE%BE.pdf%2F16&type=revision&diff=1037068&oldid=1037067

ravi-annaswamy commented 5 years ago

The red is better and flawless

Is that my result googlecloud or from googleocrjs

Sent from my iPhone

On Nov 16, 2019, at 3:53 AM, Shreeshrii notifications@github.com wrote:

I OCRed this page using google cloud text api and found no errors. So my guess is if we switch to google ocr, we should be good even now. I think google OCR has improved since this page was OCRed last time.

Ravi, I saw the page number (16) on the earlier post and tried Google OCR just now. For comparison, I also posted your OCR results. There are differences between the two. Please see:

https://ta.wikisource.org/w/index.php?title=%E0%AE%AA%E0%AE%95%E0%AF%8D%E0%AE%95%E0%AE%AE%E0%AF%8D%3A%E0%AE%A8%E0%AF%80%E0%AE%B2%E0%AE%BE_%E0%AE%AE%E0%AE%BE%E0%AE%B2%E0%AE%BE.pdf%2F16&type=revision&diff=1037068&oldid=1037067

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Shreeshrii commented 5 years ago

Google cloud text is red.

So, my assumption that googleocr was same as google vision api is incorrect, or maybe they use different versions.

One question: Did you preprocess the images in any way before using Google cloud?

I looked at the image in Wikisource, it is 150 dpi and text is grainy (not solid black) though it looks much clearer in the PDF viewer.

On Sat, Nov 16, 2019, 19:04 Ravi Annaswamy notifications@github.com wrote:

The red is better and flawless

Is that my result googlecloud or from googleocrjs

Sent from my iPhone

On Nov 16, 2019, at 3:53 AM, Shreeshrii notifications@github.com wrote:

I OCRed this page using google cloud text api and found no errors. So my guess is if we switch to google ocr, we should be good even now. I think google OCR has improved since this page was OCRed last time.

Ravi, I saw the page number (16) on the earlier post and tried Google OCR just now. For comparison, I also posted your OCR results. There are differences between the two. Please see:

https://ta.wikisource.org/w/index.php?title=%E0%AE%AA%E0%AE%95%E0%AF%8D%E0%AE%95%E0%AE%AE%E0%AF%8D%3A%E0%AE%A8%E0%AF%80%E0%AE%B2%E0%AE%BE_%E0%AE%AE%E0%AE%BE%E0%AE%B2%E0%AE%BE.pdf%2F16&type=revision&diff=1037068&oldid=1037067

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/KaniyamFoundation/ProjectIdeas/issues/71?email_source=notifications&email_token=ABG37I2PRJGCRBAHIXEMNELQT7ZGPA5CNFSM4IPOWI32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEHRU7Y#issuecomment-554637951, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I3SCK2X2TYJY7QEMFLQT7ZGPANCNFSM4IPOWI3Q .

ravi-annaswamy commented 5 years ago

Thanks Shree.

That is very likely that Google cloud OCR is better than the google ocr available on wikimedia. I do not know if the google docs ocr is same as cloud version or the wikimedia version. I would guess the former.

I did not do any preprocessing. In fact I have tried with even bad scans, and cloud api results are pretty good. Rarely I see errors.

I upload a PDF and have code that can call their API in asynchronous mode. That is I issue the command to do PDF OCR. Then once done, I retrieve the json results and have written some code to retrieve and extract the text portions.

The json result has letter by letter location and recognition and confidence and also bounding boxes at letter, word, para and block levels.

You can test out the cloud API returned result by drag and drop of a jpg onto this link:

https://cloud.google.com/vision/

there is a Try the API box where you can drag this into.

Thanks Ravi

On Sat, Nov 16, 2019 at 11:13 AM Shreeshrii notifications@github.com wrote:

Google cloud text is red.

So, my assumption that googleocr was same as google vision api is incorrect, or maybe they use different versions.

One question: Did you preprocess the images in any way before using Google cloud?

I looked at the image in Wikisource, it is 150 dpi and text is grainy (not solid black) though it looks much clearer in the PDF viewer.

On Sat, Nov 16, 2019, 19:04 Ravi Annaswamy notifications@github.com wrote:

The red is better and flawless

Is that my result googlecloud or from googleocrjs

Sent from my iPhone

On Nov 16, 2019, at 3:53 AM, Shreeshrii notifications@github.com wrote:

I OCRed this page using google cloud text api and found no errors. So my guess is if we switch to google ocr, we should be good even now. I think google OCR has improved since this page was OCRed last time.

Ravi, I saw the page number (16) on the earlier post and tried Google OCR just now. For comparison, I also posted your OCR results. There are differences between the two. Please see:

https://ta.wikisource.org/w/index.php?title=%E0%AE%AA%E0%AE%95%E0%AF%8D%E0%AE%95%E0%AE%AE%E0%AF%8D%3A%E0%AE%A8%E0%AF%80%E0%AE%B2%E0%AE%BE_%E0%AE%AE%E0%AE%BE%E0%AE%B2%E0%AE%BE.pdf%2F16&type=revision&diff=1037068&oldid=1037067

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/KaniyamFoundation/ProjectIdeas/issues/71?email_source=notifications&email_token=ABG37I2PRJGCRBAHIXEMNELQT7ZGPA5CNFSM4IPOWI32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEHRU7Y#issuecomment-554637951 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ABG37I3SCK2X2TYJY7QEMFLQT7ZGPANCNFSM4IPOWI3Q

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/KaniyamFoundation/ProjectIdeas/issues/71?email_source=notifications&email_token=AGMNHP2GI5MLO6SRR2APTQDQUAL3VA5CNFSM4IPOWI32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEHUZ4A#issuecomment-554650864, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGMNHP7GM2EW5NWRI4WPTT3QUAL3VANCNFSM4IPOWI3Q .

Shreeshrii commented 5 years ago

The Google OCR tool adds a Page-namespace toolbar button that will derive text from the current page's image, via Google's Cloud Vision API OCR service.

ref:

@tshrinivasan @balajijagadesh Ravi is getting better OCR results from Vision API compared to the wiki gadget. Please check if the API version or any options need change/update for the gadget.

G-cell-coder commented 2 years ago

It was very easy to extract the text script using below service offered by GCP. https://cloud.google.com/vision/ One challenge is input has to be inform of image file formats, I believe this can further enhanced to read the textual input data through IP Camera - seen with one of the implementation "[Real-time-OCR-Text-To-Speech-with-Tesseract]" where test from IP camera is extracted and converts to voice. It has wide scope with development of many applications.

KaniyamFoundation / ProjectIdeas

Training Opensource Tesseract OCR for Tamil #71