Anish-M-code / pdftotext

A simple pdftotext conversion tool for Windows 8.1/10/11 and FEDORA/UBUNTU/DEBIAN/ARCH based linux distros using poppler-utils and Google's tesseract-ocr.
MIT License
13 stars 3 forks source link

Add support for non english languages. #3

Open Anish-M-code opened 2 years ago

Anish-M-code commented 2 years ago

Currently Pdftotext only supports english , potential contributors may try to add non english languages , simplify installation and uninstallation of additional language packs , add code to support above mentioned features on both linux and windows.

pravincoder commented 2 years ago

@Anish-M-code I can do it , but I think I might need your help !

Anish-M-code commented 2 years ago

Sure @pravincoder feel free to contribute and open pull request, i can provide guidence if you need any.

chirag4862 commented 2 weeks ago

Hi! Currently the package uses tesseract for ocr operations, I was thinking for multiple language support can I introduce a different onnx model other than tesseract? With the Models bundled up with the package it will support both windows and linux without the requirement of tesseract already being installed in the system.

Anish-M-code commented 2 weeks ago

sure @chirag4862 we can try other models as well