Bad insertion text on PDF

FloLaco commented 1 year ago

Hi

I'm trying to OCR the text on my pdf for personal use. I've check the TXT file generated, and it's working (I'm seeing the proper text). But when I open my PDF file (with OCR), if I search text, it does not work. If I copy/paste text from PDF, it's a weird text :

22tropnosseccaevitartsinimdatimrepdluowspuorgeerhtllA.puorgrevresnoitacilppaehtotylnonepo
erucesylhgihyolpednacuoy,msinahcemsihthtiW.krowtenetaroprocs’remotsucehtmorfylnotub .snoitacilppa

LeoFCardoso commented 1 year ago

Hello @FloLaco thank you for the issue. I could reproduce.

I'll check it out. By now you can try "-f -g smart" flags as a workaround.

pdf2pdfocr -i ./Module 3 - .v2.pdf -f -g smart

LeoFCardoso commented 1 year ago

@FloLaco I opened an issue in qpdf project as this looks like a bug in that project.

I also used ghostscript to try a "repair" in your source pdf file, as illustrated in https://superuser.com/questions/278562/how-can-i-fix-repair-a-corrupted-pdf-file

Page 46

The following warnings were encountered at least once while processing this file:
    encountered more q than Q

   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> PDFKit <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

After the repair, pdf2pdfocr (and qpdf) worked fine.

Please consider check the structure of your source pdf file.

LeoFCardoso commented 1 year ago

Confirmed qpdf bug and fixed in 11.3.0 version. I'm closing this. Thank you for reporting.

LeoFCardoso / pdf2pdfocr

Bad insertion text on PDF #40