LeoFCardoso / pdf2pdfocr

A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!
Apache License 2.0
266 stars 33 forks source link

pdf2pdfocr changing languages #36

Closed Cragsand closed 1 year ago

Cragsand commented 1 year ago

This wasn't included in the readme file but some info for anyone else lost. You can change the language model to download by editing this: aria2c "https://github.com/tesseract-ocr/tessdata/blob/main/por.traineddata?raw=true" --dir="%TESSDATA_PREFIX%" And change the language prefix to which language you want. As long as its available on the tesseract repo. For example here is Swedish - "swe": bild

Further info here: https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#LANGUAGES

To change default language edit pdf2pdfocr.py on line 548 from Portuguese + English - "por+eng" to whichever. For me I use Swedish + English - "swe+eng" self.tess_langs = "por+eng" # Default to self.tess_langs = "swe+eng" # Default

For example to get Swedish bild

LeoFCardoso commented 1 year ago

Thank you. "por+eng" is the default because I'm from Brazil. You can also use "-l" flag to set the language in each execution.