LeoFCardoso / pdf2pdfocr

A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!
Apache License 2.0
266 stars 33 forks source link

Zero OCR'ed files #38

Closed PatrikHlebecStor closed 1 year ago

PatrikHlebecStor commented 1 year ago

File: D:\Google_drive_sola\Sola\2022-2023\ROP - Reologija polimerov\RLP - Reologija polimerov.pdf [2023-01-14 19:20:35.717707] [DEBUG] Tesseract can 'textonly_pdf': True [2023-01-14 19:20:35.733704] [DEBUG] Tesseract version: 5 [2023-01-14 19:20:35.736704] [DEBUG] cuneiform not available [2023-01-14 19:20:35.781705] [DEBUG] Pdftoppm version: 22.12.0 [2023-01-14 19:20:35.811712] [DEBUG] Qpdf version: 11.2.0 [2023-01-14 19:20:35.811712] [DEBUG] Temp dir is C:\Users\ADMINI~1\AppData\Local\Temp\pdf2pdfocr_L3VRF\ [2023-01-14 19:20:35.811712] [DEBUG] Prefix is L3VRF [2023-01-14 19:20:35.811712] [DEBUG] Script dir is c:\Users\Administrator\anaconda3\Scripts\ [2023-01-14 19:20:35.812712] [DEBUG] Parallel operations will use 20 CPUs [2023-01-14 19:20:35.861715] [LOG] Welcome to pdf2pdfocr version 1.12.0 marapurense - https://github.com/LeoFCardoso/pdf2pdfocr [2023-01-14 19:20:35.903716] [LOG] Input file D:\Google_drive_sola\Sola\2022-2023\ROP - Reologija polimerov\RLP - Reologija polimerov.pdf: type is application/pdf [2023-01-14 19:20:35.918716] [DEBUG] User conversion params: best [2023-01-14 19:20:35.918716] [DEBUG] Output file: D:\Google_drive_sola\Sola\2022-2023\ROP - Reologija polimerov\RLP - Reologija polimerov-OCR.pdf for PDF and D:\Google_drive_sola\Sola\2022-2023\ROP - Reologija polimerov\RLP - Reologija polimerov-OCR.pdf.txt for TXT [2023-01-14 19:20:35.918716] [LOG] Converting input file to images... [2023-01-14 19:20:43.633767] [LOG] Checking blank pages C:\Users\Administrator\anaconda3\lib\site-packages\PIL\Image.py:3074: DecompressionBombWarning: Image size (105023996 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack. warnings.warn( [2023-01-14 19:20:44.652767] [LOG] Starting OCR with tesseract... [2023-01-14 19:20:45.154768] [LOG] OCR completed [2023-01-14 19:20:45.155767] [DEBUG] We have 0 ocr'ed files Error: No PDF files generated after OCR. This is not expected. Aborting.

LeoFCardoso commented 1 year ago

Can you please share input file?

PatrikHlebecStor commented 1 year ago

https://drive.google.com/open?id=1bjsNURMOBqGr-fpm3HT1XfmFVOKRTueF&authuser=ph6912%40student.uni-lj.si&usp=drive_fs

Just out of curiosity, the installation is ok?

PDF is output from the notetaking app Inkodo, from the Microsoft store.

LeoFCardoso commented 1 year ago

Hello @PatrikHlebecStor.

Your PDF worked with me. :(

Please try to add "-r 200" in command line (this decrease image resolution and must solve DecompressionBombWarning).

Others PDF files can be OCRed in your installation?

LeoFCardoso commented 1 year ago

Closing due to inactivity