Preserve Table of Contents

rien333 commented 5 years ago

What a great little app!

However, after processing a pdf with krop, the table of contents "metadata" seems to be deleted. Is there any way to retain it? (makes sense for ereaders too, navigating to a specific part is especially cumbersome on slower and smaller devices if you can't just select a chapter)

kupiqu commented 4 years ago

I second this!

arminstraub commented 4 years ago

It would be great to be able to preserve tables of contents and links within the PDF (it's been on my personal wishlist for a long time). Unfortunately, I don't think this is possible using the PDF library that krop currently uses. If anyone has an idea how to possibly implement this, I would love to learn about it!

rien333 commented 4 years ago

Unfortunately, I don't think this is possible using the PDF library that krop currently uses

That would be poppler, right? Also, I still really enjoy using krop. Simple, but well executed.

arminstraub commented 4 years ago

Thank you for the kind words! I'm glad you find krop useful despite this shortcoming. Poppler is used for displaying the PDF, but the cropping is done using PyPDF2.

chrthi commented 4 years ago

After looking a bit through PyPDF2, I was able to preserve links within a PDF with this change:

diff --git a/krop/mainwindow.py b/krop/mainwindow.py
index fd1ae32..e8adadf 100644
--- a/krop/mainwindow.py
+++ b/krop/mainwindow.py
@@ -413,6 +413,7 @@ class MainWindow(QKMainWindow):
             pdf = PdfFile()
             pdf.loadFromFile(inputFileName)
             cropper = PdfCropper()
+            cropper.copyDocumentRoot(pdf)
             for nr in pages:
                 c = self.viewer.cropValues(nr)
                 cropper.addPageCropped(pdf, nr, c, alwaysinclude, rotation)
diff --git a/krop/pdfcropper.py b/krop/pdfcropper.py
index 679c6fc..21a0df1 100644
--- a/krop/pdfcropper.py
+++ b/krop/pdfcropper.py
@@ -56,6 +56,9 @@ class AbstractPdfCropper:
     def addPageCropped(self, pdffile, pagenumber, croplist, rotate=0):
         pass

+    def copyDocumentRoot(self, pdffile):
+        pass
+

 class PyPdfFile(AbstractPdfFile):
     """Implementation of PdfFile using pyPdf"""
@@ -110,6 +113,15 @@ class PyPdfCropper(AbstractPdfCropper):
         if rotate != 0:
             page.rotateClockwise(rotate)

+    def copyDocumentRoot(self, pdffile):
+        # Sounds promising in PyPDF2 (see PdfFileWriter.cloneDocumentFromReader),
+        # but doesn't seem to produce a readable PDF:
+        # self.output.cloneReaderDocumentRoot(pdffile.reader)
+        # Instead, this copies at least the named destinations for links:
+        for dest in pdffile.reader.namedDestinations.values():
+            self.output.addNamedDestinationObject(dest)
+
+
 def optimizePdfGhostscript(oldfilename, newfilename):
     import subprocess
     subprocess.check_call(('gs', '-sDEVICE=pdfwrite', '-sOutputFile=' + newfilename,

It seems PyPDF2 has a special method to copy all such metadata at once named cloneReaderDocumentRoot, but that gave me a document with a lot of empty pages and only the links. So copying just the named destinations for links was the best I could come up with for now. If you would like to experiment further, I suggest a python debugger or using ptpython interactively in a script like this:

#!/usr/bin/env python3
from PyPDF2 import PdfFileReader
from ptpython.repl import embed

if __name__=="__main__":
    with open("test.pdf", "rb") as infile:
        pdf = PdfFileReader(infile)
        embed(globals(), locals())

This prepares a reader and then drops you to an interactive REPL with useful autocompletion.

edumco commented 3 years ago

@arminstraub the previous comment seems to solve this issue. Am I wrong?

arminstraub commented 3 years ago

@chrthi Thank you so much for offering this solution to preserving links! It didn't work on all the PDFs that I tested it with, but it was definitely better than nothing. It should be part of the next release of krop.

Once I have some time (hard these days...), I am planning to add support for pikepdf to krop which hopefully will make these sorts of things easier to work with.

64kramsystem commented 2 years ago

After looking a bit through PyPDF2, I was able to preserve links within a PDF with this change:
diff --git a/krop/mainwindow.py b/krop/mainwindow.py
...

Just tried this, but unfortunately doesn't work! The links are not preserved.

arminstraub / krop

Preserve Table of Contents #22