LeoFCardoso / pdf2pdfocr

A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!
Apache License 2.0
273 stars 34 forks source link

File doesn't pass PDF/A validation after OCR #39

Closed solis-eduardo closed 1 year ago

solis-eduardo commented 1 year ago

File used PDF A-1b.pdf Site for validation: VeraPDF Demo Terminal output:

eduardo@000563-desk:~/Área\ de Trabalho/testepdf$ python3 ~/Área\ de\ Trabalho/pdf2pdfocr/pdf2pdfocr.py -w -o pdfabr.pdf -v -l por -i ~/Documentos/PDF\ A-1b.pdf`
File: /home/eduardo/Documentos/PDF A-1b.pdf
[2023-02-13 10:25:11.326759] [DEBUG] Tesseract can 'textonly_pdf': True
[2023-02-13 10:25:11.329506] [DEBUG] Tesseract version: 4
[2023-02-13 10:25:11.329598] [DEBUG] cuneiform not available
[2023-02-13 10:25:11.336831] [DEBUG] Pdftoppm version: 22.2.0
[2023-02-13 10:25:11.340303] [DEBUG] Qpdf version: 10.6.3
[2023-02-13 10:25:11.340382] [DEBUG] Temp dir is /tmp/pdf2pdfocr_O4M39/
[2023-02-13 10:25:11.340396] [DEBUG] Prefix is O4M39
[2023-02-13 10:25:11.340413] [DEBUG] Script dir is /home/eduardo/Área de Trabalho/pdf2pdfocr/
[2023-02-13 10:25:11.340442] [DEBUG] Parallel operations will use 8 CPUs
[2023-02-13 10:25:11.349594] [LOG] Welcome to pdf2pdfocr version 1.12.0 marapurense  - https://github.com/LeoFCardoso/pdf2pdfocr
[2023-02-13 10:25:11.350756] [LOG] Input file /home/eduardo/Documentos/PDF A-1b.pdf: type is application/pdf
[2023-02-13 10:25:11.352185] [DEBUG] User conversion params: 
[2023-02-13 10:25:11.352209] [DEBUG] Output file: pdfabr.pdf for PDF and pdfabr.pdf.txt for TXT
[2023-02-13 10:25:11.352249] [LOG] Converting input file to images
[2023-02-13 10:25:11.427845] [LOG] Checking blank pages
[2023-02-13 10:25:11.928593] [LOG] Starting OCR with tesseract...
[2023-02-13 10:25:13.430893] [LOG] OCR completed
[2023-02-13 10:25:13.431167] [DEBUG] We have 1 ocr'ed files
[2023-02-13 10:25:13.432973] [DEBUG] Joined ocr'ed PDF files
[2023-02-13 10:25:13.433220] [LOG] Created final text file
[2023-02-13 10:25:13.433241] [DEBUG] Merging with OCR
[2023-02-13 10:25:13.445310] [DEBUG] Autorotate skipped
[2023-02-13 10:25:13.445371] [DEBUG] Editing producer
[2023-02-13 10:25:13.458038] [DEBUG] Output file created
[2023-02-13 10:25:13.466554] [LOG] Success in 2.117 seconds!

Validation output: image

LeoFCardoso commented 1 year ago

Hello, thank you for your issue. By now, this is not a feature of pdf2pdfocr.