Create python script to run OCR using tesseract

KaniyamFoundation / ProjectIdeas

A Place to write down the project ideas and to plan them

40 stars 3 forks source link

Create python script to run OCR using tesseract #94

Closed Ananthsubray closed 4 years ago

Ananthsubray commented 5 years ago

Tesseract OCR is currently giving a good output for the India languages. With the help of others, we are able to develop a JS script to OCR single page on Wiki. It will be good to have the python script to run OCR using the tesseract for bulk pages, something similar to OCR4Wikisource using Google OCR.

arunlouie commented 4 years ago

We have to split up implementation as specific modules. I see many features can be commonly used for other projects as well. My Suggestions is to have modules,

Capability to load different file types
Read as in words from the documents
Curation - Make some auto correction
Maintain dictionary for curation
etc

balajijagadesh commented 4 years ago

Wonderful suggestions

On Sun, 8 Mar, 2020, 2:45 PM arunlouie, notifications@github.com wrote:

We have to split up implementation as specific modules. I see many features can be commonly used for other projects as well. My Suggestions is to have modules,

Capability to load different file types

Read as in words from the documents

Curation - Make some auto correction

Maintain dictionary for curation

etc

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/KaniyamFoundation/ProjectIdeas/issues/94?email_source=notifications&email_token=AESGXRG3YLXHIUEPBNPBFQDRGNOZ5A5CNFSM4JKE3ZRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOEQR2Q#issuecomment-596183274, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESGXRAVNKAK24O6TP3XQXTRGNOZ5ANCNFSM4JKE3ZRA .

rajeshkumargp commented 4 years ago

Can we make use of these existing repos: 1) https://github.com/madmaze/pytesseract (Getting frequent Updates) 2) https://github.com/ratazzi/tesseract-ocr

@Ananthsubray and @tshrinivasan , Can you please elaborate the requirement

tshrinivasan commented 4 years ago

Here a linux version to OCR a given PDF file https://gist.github.com/tshrinivasan/0aaf78e5808ee29490928614882cded0

Here is a windows GUI version https://github.com/Parathantl/tesseract_gui/releases

Demo video in tamil - https://www.youtube.com/watch?v=363DGNL-rUw

Detailed notes are here https://goinggnu.wordpress.com/2020/05/23/tesseract-ocr-gui-for-windows/

Thanks to @Parathantl for the windows version.