freedomofpress / dangerzone

Take potentially dangerous PDFs, office documents, or images and convert them to safe PDFs
https://dangerzone.rocks/
GNU Affero General Public License v3.0
3.56k stars 166 forks source link

Embedded hyperlinks break on line breaks #158

Open huertanix opened 2 years ago

huertanix commented 2 years ago

After converting a PDF file with an embedded hyperlink, the safe copy splits hyperlinks between line breaks rather than keeping the link intact across the line break.

e.g. a line like this:

tktktktktktktktktktktktk https://datatracker.ie tf.org/doc/draft-knodel-e2ee-definition/.

...creates two links, one "https://datatrack.erie/" link on the first line, and a different "http://tf.org/doc/draft-knodel-e2ee-definition/"

To reproduce:

  1. Process a PDF with a long, multi-line hyperlink. Example used can be found here: https://eprint.iacr.org/2022/449
  2. Open the safe PDF and find the same link
  3. click on the link on the first line of text that the link appears in, and then click on the second link
gmarmstrong commented 2 years ago

What PDF viewer are you using? Dangerzone doesn't create hyperlinks in that document on my end (at least, not as of https://github.com/freedomofpress/dangerzone/pull/161), nor in any other document I've tried. Could the PDF viewer be inferring that tf.org/doc/draft-knodel-e2ee-definition/ is a link of its own because of the .org?

EDIT: Here's something to investigate: a StackExchange comment regarding using ps2pdf for PDF compression (which we do) alludes to this issue:

"Despite the fact that this one approach became my favorite solution to compress pdf files, it breaks up url links the document may have [...]"

If that's what's happening, maybe we could provide an option to skip file compression.