Closed Ananthsubray closed 4 years ago
We have to split up implementation as specific modules. I see many features can be commonly used for other projects as well. My Suggestions is to have modules,
Wonderful suggestions
On Sun, 8 Mar, 2020, 2:45 PM arunlouie, notifications@github.com wrote:
We have to split up implementation as specific modules. I see many features can be commonly used for other projects as well. My Suggestions is to have modules,
- Capability to load different file types
- Read as in words from the documents
- Curation - Make some auto correction
- Maintain dictionary for curation
- etc
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/KaniyamFoundation/ProjectIdeas/issues/94?email_source=notifications&email_token=AESGXRG3YLXHIUEPBNPBFQDRGNOZ5A5CNFSM4JKE3ZRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOEQR2Q#issuecomment-596183274, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESGXRAVNKAK24O6TP3XQXTRGNOZ5ANCNFSM4JKE3ZRA .
Can we make use of these existing repos: 1) https://github.com/madmaze/pytesseract (Getting frequent Updates) 2) https://github.com/ratazzi/tesseract-ocr
@Ananthsubray and @tshrinivasan , Can you please elaborate the requirement
Here a linux version to OCR a given PDF file https://gist.github.com/tshrinivasan/0aaf78e5808ee29490928614882cded0
Here is a windows GUI version https://github.com/Parathantl/tesseract_gui/releases
Demo video in tamil - https://www.youtube.com/watch?v=363DGNL-rUw
Detailed notes are here https://goinggnu.wordpress.com/2020/05/23/tesseract-ocr-gui-for-windows/
Thanks to @Parathantl for the windows version.
Tesseract OCR is currently giving a good output for the India languages. With the help of others, we are able to develop a JS script to OCR single page on Wiki. It will be good to have the python script to run OCR using the tesseract for bulk pages, something similar to OCR4Wikisource using Google OCR.