LeoFCardoso / pdf2pdfocr

A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!
Apache License 2.0
266 stars 33 forks source link

Bad insertion text on PDF #40

Closed FloLaco closed 1 year ago

FloLaco commented 1 year ago

Hi

PDF source : Module 3 - .v2.pdf

I'm trying to OCR the text on my pdf for personal use. I've check the TXT file generated, and it's working (I'm seeing the proper text). But when I open my PDF file (with OCR), if I search text, it does not work. If I copy/paste text from PDF, it's a weird text :

22tropnosseccaevitartsinimdatimrepdluowspuorgeerhtllA.puorgrevresnoitacilppaehtotylnonepo
erucesylhgihyolpednacuoy,msinahcemsihthtiW.krowtenetaroprocs’remotsucehtmorfylnotub .snoitacilppa
LeoFCardoso commented 1 year ago

Hello @FloLaco thank you for the issue. I could reproduce.

I'll check it out. By now you can try "-f -g smart" flags as a workaround.

pdf2pdfocr -i ./Module 3 - .v2.pdf -f -g smart

LeoFCardoso commented 1 year ago

@FloLaco I opened an issue in qpdf project as this looks like a bug in that project.

I also used ghostscript to try a "repair" in your source pdf file, as illustrated in https://superuser.com/questions/278562/how-can-i-fix-repair-a-corrupted-pdf-file

Page 46

The following warnings were encountered at least once while processing this file:
    encountered more q than Q

   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> PDFKit <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

After the repair, pdf2pdfocr (and qpdf) worked fine.

Please consider check the structure of your source pdf file.

LeoFCardoso commented 1 year ago

Confirmed qpdf bug and fixed in 11.3.0 version. I'm closing this. Thank you for reporting.