coherentgraphics / cpdf-binaries

PDF Command Line Tools binaries for Linux, Mac, Windows
GNU Affero General Public License v3.0
593 stars 42 forks source link

Removing existing object streams #58

Closed TiffanyNerd closed 2 years ago

TiffanyNerd commented 2 years ago

Hello,

I’ve just discovered cpdf when I stumbled upon this discussion: https://gist.github.com/hubgit/6078384

Cpdf is absolutely amazing!!!

In order to achieve all the modifications I need done to PDF files, I usually use Infix Pro, Acrobat X Pro, BeCyPDFMetaEdit, qpdf, Exif Tools, pdftk, and probably something else I cannot recall!

None of the above mentioned can modify the original File ID, and I’ve just discovered that cpdf can do this along with many other interesting things, and so this is very exciting!

But I’ve encountered a strange issue with one of my modified PDF files. I had used Infix Pro to modify some text in the PDF file, and that works great. Except that Infix Pro leaves a lot of traces. If I open my PDF file in Notepad, I can see all the object streams, one after the other, documenting all the Infix сhanges:

0 obj << /AcroForm 3 0 R /Infix << /Changes [ 4 0 R 5 0 R 6 0 R 7 0 R … etc

This is soon followed by an endless list of object streams that mention the date/time stamp of each modification and my name, that’s the user’s name, for example:

0 obj << /ModDate (D:20181110085910) /Pages (1) /User (my name)

endobj4

My only solution to "sanitizing" and thus removing this information is to open my modified PDF file in Adobe Acrobat Reader and then simply Print as Adobe PDF. This creates a new PDF file that inherits zero object streams from my modified PDF, and also comes with a new File ID (DocumentID and InstanceID identical). The downside to this “Print as Adobe PDF” method is that sometimes the rendered quality is not good enough, even if I set all the possible printing quality options to the best possible, with no image compressions etc.

I think that I’ve tried all possible solutions through cpdf, but I’m unable to permanently remove the object streams that had been injected by Infix. I've tried many commands described in the cpdf manual, such as garbage collection, not preserving object streams, creating and not preserving object streams, removing metadata, copying File ID, creating new PDF through cpdf then merging with my modified PDF...

At one stage, I thought that some manipulation had worked, because I opened the cpdf output file in Notepad, and all I could see is some type of Chinese script, it was total gibberish but at least it was totally unreadable! However, I then opened this output PDF file in BeCyPDFMetaEdit, entered all the meta data I needed on there, such as Author, Creation Date, etc, saved it. Then I opened it again in Notepad, and all the Infix object streams had resurfaced, and the Chinese script was totally gone!

If ever anyone has an explanation for this, or a solution? I would like to continue using BeCyPDFMetaEdit as the very last step of the modification process, as it’s much faster to type in all the meta data modifications into the little GUI (so more user-friendly). And even if I don't use the BeCy GUI, I would still like to be reassured that the object streams are gone for good and cannot be so easily recovered as running the file through BeCy.

Thanks very much for your help!

johnwhitington commented 2 years ago

It looks like the text you have added to the PDF is in the form of text annotations on top of the page, rather than actual text on the page. We should be able to find a way to remove the /User part with -remove-dict-entry from the last chapter of the cpdf manual.

Are you able to send an example file? If so, please send it to john at coherentgraphics dot co dot uk...

TiffanyNerd commented 2 years ago

Hi John,

Thanks so very much for your rapid reply.

Your suggestion worked a treat! I managed to remove all the elements I didn’t want to see in there by using -remove-dict-entry and without damaging the actual file. I must say that it was like performing surgery: I had to remove each entry separately one by one, such as /ModDate then /Pages then /User. And same for /AcroForm then /Infix then /Changes … etc… etc…

I then ran the output file through BeCyPDFMetaEdit as a complete rewrite and added all my metadata. The file came out with no issues, no reversals or resurfacing of the removed dictionary entries.

Re: annotations. I checked the file for annotations by using -list-annotations but there are none. And I doubt that the text I had modified in the PDF is in the form of text annotations, it’s just Infix Pro adding an entry for every single modification to the file. I think that I read something about this procedure in their manual. They do explain that once the text modifications are saved through Infix, there is no going back to the previous version (unless I had created a backup previously).

Thank you very much for your help! And thank you for creating cpdf, it’s a lifesaver, it’s simply the best! :)

johnwhitington commented 2 years ago

Thanks for the detailed feedback.