freedomofpress / dangerzone

Take potentially dangerous PDFs, office documents, or images and convert them to safe PDFs
https://dangerzone.rocks/
GNU Affero General Public License v3.0
3.66k stars 172 forks source link

The font is unreadable and it looks as if the last character of each line cannot be marked #457

Closed sudwhiwdh closed 1 year ago

sudwhiwdh commented 1 year ago

Marked PDF text before conversion by dangerzone with OCR 1

Marked PDF text after conversion by dangerzone with OCR 2

The font is unreadable and it looks as if the last character of each line cannot be marked. Which is not true, because when I copy the marked text into a text editor using keyboard shortcuts, all the characters are there.

The document and OCR language is both English.

dangerzone version: 0.4.1

I would expect the font to be readable and visually I can also mark all characters of a converted document.

apyrgio commented 1 year ago

Quick question, are you using Evince to read the PDF? I've experienced the same, but when opening the OCRed document in Firefox (which uses pdf.js), I don't see an issue. My guess is that the rendering engine that Evince users (poppler) is the one that displays the text incorrectly. They had a similar issue a few years ago, which reinforces my suspicions.

sudwhiwdh commented 1 year ago

Yes, it's Evince 44.3.

Unfortunately, I cannot open your link. Would it make sense to create another issue for this in Evince or Poppler?

It actually works when I open the PDF with, in that case, LibreWolf. What unfortunately does not work for me right now: "Open safe documents after converting" and then select the browser. I only have document viewer (Evince) and GNU Image Manipulation Program to choose from. This would simplify the workaround via the browser.

The last letter/character of a line is unfortunately not marked with this solution either. And with Ctrl C + Ctrl V it is visible when copied into a text document. This is probably another known issue, isn't it?

deeplow commented 1 year ago

I think this is a side-effect of Tesseract, the Optical Character Recognition (OCR) tool we use and the renderer as @apyrgio was pointing out. Tesseract essentially adds a text layer on top of the image of each page with text. Most of the times that text layer will be with a different font with different width. This would explain why it seems not to select the last character — in reality it's selecting the same text but because the font has a smaller-width, the highlightable layer ends first.

deeplow commented 1 year ago

Closing this since since it's just a side-effect of the tool and doesn't actually impact the selection. We could reconsider tesseract at some point if there's a better tool.