The font is unreadable and it looks as if the last character of each line cannot be marked

freedomofpress / dangerzone

Take potentially dangerous PDFs, office documents, or images and convert them to safe PDFs

https://dangerzone.rocks/

GNU Affero General Public License v3.0

3.66k stars 172 forks source link

The font is unreadable and it looks as if the last character of each line cannot be marked #457

Closed sudwhiwdh closed 1 year ago

sudwhiwdh commented 1 year ago

Marked PDF text before conversion by dangerzone with OCR

Marked PDF text after conversion by dangerzone with OCR

The font is unreadable and it looks as if the last character of each line cannot be marked. Which is not true, because when I copy the marked text into a text editor using keyboard shortcuts, all the characters are there.

The document and OCR language is both English.

dangerzone version: 0.4.1

I would expect the font to be readable and visually I can also mark all characters of a converted document.

apyrgio commented 1 year ago

Quick question, are you using Evince to read the PDF? I've experienced the same, but when opening the OCRed document in Firefox (which uses pdf.js), I don't see an issue. My guess is that the rendering engine that Evince users (poppler) is the one that displays the text incorrectly. They had a similar issue a few years ago, which reinforces my suspicions.

sudwhiwdh commented 1 year ago

Yes, it's Evince 44.3.

Unfortunately, I cannot open your link. Would it make sense to create another issue for this in Evince or Poppler?

It actually works when I open the PDF with, in that case, LibreWolf. What unfortunately does not work for me right now: "Open safe documents after converting" and then select the browser. I only have document viewer (Evince) and GNU Image Manipulation Program to choose from. This would simplify the workaround via the browser.

The last letter/character of a line is unfortunately not marked with this solution either. And with Ctrl C + Ctrl V it is visible when copied into a text document. This is probably another known issue, isn't it?

deeplow commented 1 year ago

I think this is a side-effect of Tesseract, the Optical Character Recognition (OCR) tool we use and the renderer as @apyrgio was pointing out. Tesseract essentially adds a text layer on top of the image of each page with text. Most of the times that text layer will be with a different font with different width. This would explain why it seems not to select the last character — in reality it's selecting the same text but because the font has a smaller-width, the highlightable layer ends first.

deeplow commented 1 year ago

Closing this since since it's just a side-effect of the tool and doesn't actually impact the selection. We could reconsider tesseract at some point if there's a better tool.