LeoFCardoso / pdf2pdfocr

A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!
Apache License 2.0
266 stars 33 forks source link

Error Message by OCR via GUI #27

Closed dempfma closed 2 years ago

dempfma commented 2 years ago

Dear Leanardo I´ve get an error when i try to OCR a Pdf file. Maybe you can help me ?
I use Windows 10 21H1 in Virtualbox with 4 cores and 16GB Memory for this vm. Message is: [2022-02-19 18:44:06.020876] [DEBUG] Tesseract can 'textonly_pdf': True [2022-02-19 18:44:06.050413] [DEBUG] Tesseract version: 5 [2022-02-19 18:44:06.050413] [DEBUG] cuneiform not available [2022-02-19 18:44:06.282093] [DEBUG] Pdftoppm version: 22.01.0 [2022-02-19 18:44:06.391073] [DEBUG] Qpdf version: 10.6.2 [2022-02-19 18:44:06.391073] [DEBUG] Temp dir is C:\Users\Martin\AppData\Local\Temp\pdf2pdfocr_ONWZ5\ [2022-02-19 18:44:06.391073] [DEBUG] Prefix is ONWZ5 [2022-02-19 18:44:06.391073] [DEBUG] Script dir is C:\Users\Martin\pdf2pdfocr-venv\Scripts\ [2022-02-19 18:44:06.391073] [DEBUG] Parallel operations will use 4 CPUs [2022-02-19 18:44:06.507230] [LOG] Welcome to pdf2pdfocr version 1.9.1 marapurense - https://github.com/LeoFCardoso/pdf2pdfocr [2022-02-19 18:44:06.641453] [LOG] Input file C:\Users\Martin\Desktop\471685214.pdf: type is application/pdf [2022-02-19 18:44:06.641453] [DEBUG] Conversion params: [2022-02-19 18:44:06.641453] [DEBUG] Output file: C:\Users\Martin\Desktop\471685214-OCR.pdf for PDF and C:\Users\Martin\Desktop\471685214-OCR.pdf.txt for TXT [2022-02-19 18:44:06.641453] [LOG] Converting input file to images... [2022-02-19 18:44:06.903005] [LOG] Starting OCR with tesseract... [2022-02-19 18:44:07.365611] [LOG] OCR completed [2022-02-19 18:44:07.365611] [DEBUG] We have 0 ocr'ed files No PDF files generated after OCR. This is not expected. Aborting.

Did I need cuneiform- i read your windows install.txt file and read this as optional, maybe I´m wrong.
It´s a interesting tool and would fit for me perfect to create a database of my private papers. thx a lot Martin

LeoFCardoso commented 2 years ago

Hello, thank you for the issue. Cuneiform is really optional and it's not causing this issue. I reviewed the install_windows.txt file and tesseract languages download is broken due to URL change. I will upload a fixed install_windows.txt file.

Can you please try this in CMD?

scoop install --arch 64bit tesseract-languages

And try again?

Please note that "por" (Portuguese) is default language in GUI. You can change this option in "Advanced options" tab. Without installing languages with above command, "eng" option should work.

LeoFCardoso commented 2 years ago

"tesseract-languages" looks buggy in scoop. Please try manual download of your preferred languages, example for Portuguese and Spanish:

aria2c "https://github.com/tesseract-ocr/tessdata/blob/main/por.traineddata?raw=true" --dir="%TESSDATA_PREFIX%"
aria2c "https://github.com/tesseract-ocr/tessdata/blob/main/spa.traineddata?raw=true" --dir="%TESSDATA_PREFIX%"
dempfma commented 2 years ago

Dear Leonardo, thanks a lot for fast support- now it works fine with english interface. Now I have to find out how I can get a sucessful finding. Thats another topic. I tried to install german language again but it says it is installed and renames it to deu1. If i change to deu or deu1 in advanced settings than conversion ends with error. It is necessary to use corrrect language for OCR function ?- If not than its nothing to fix for me :-) thanks Martin

LeoFCardoso commented 2 years ago

You're welcome!

Please try to list installed languages with: tesseract --list-langs

The train data languages files are stored in TESSDATA_PREFIX env var.

You can list all languages files with dir %TESSDATA_PREFIX%

Tesseract works better with correct language definition and you can use '+' for multiple languages (eng+deu for English and German for example).

Please try deleting "deu" and "deu.1" traineddata and download it again. It should work.

If not, please send me some public PDF file in German for testing.

Thank you!

LeoFCardoso commented 2 years ago

Marking as closed with new URLs for language download.