Optimal command chain to get smallest file size losslessly?

coherentgraphics / cpdf-binaries

PDF Command Line Tools binaries for Linux, Mac, Windows

GNU Affero General Public License v3.0

593 stars 42 forks source link

Optimal command chain to get smallest file size losslessly? #60

Closed necros2k7 closed 1 year ago

necros2k7 commented 2 years ago

Am I right in my flags to get smallest files losslessly?

decompress>compress>removeID>removeMeta>dedup>clean>squeeze?

Even after that my output file was further reduced in size by Pdfsizeopt dead utility (using GS?): "info: eliminated 2 unused objs in 2 classes info: compressed 127 streams, kept 127 of them uncompressed"

johnwhitington commented 2 years ago

You can skip decompress, compress, dedup and clean, I think, since squeeze does all of that.

Are you able to provide an example file I can look at? I'm not familiar with pdfsizeopt, but I can look into it.

necros2k7 commented 2 years ago

Actually no, I tested this chain and it squeezed few more bytes. Will upload sample later. https://github.com/pts/pdfsizeopt

necros2k7 commented 2 years ago

sample: tst.pdf.gz

info: eliminated 2 unused objs in 2 classes info: compressed 3 streams, kept 3 of them uncompressed info: saving PDF with 13 objs to: tst2.pdf info: generated object stream of 789 bytes in 9 objects (25%) info: generated 115232 bytes (67%)

johnwhitington commented 2 years ago

On this file, since it only uses Standard 14 fonts (i.e Times New Roman), you can use -remove-fonts to get it down to about 70k. The ISO standardisation people will get grumpy, but the reality is that every PDF viewer will always have the 14 standard fonts built in.

The rest of the file is then just the image. Our squeezer, being non-lossy, won't touch that.

necros2k7 commented 2 years ago

How to know if pdf have Standard 14 fonts? Considering images they can be losslessly reduced further - like strip meta from them , recompress pngs, jpgs. Option during squeezing to strip embedded files would be also very useful. Same goes to Standard fonts stripping. Also in what place of commands above should we put -remove-fonts optimally? Also does just -remove-font w/o -squeeze command is lossless to other objects (pix)?

johnwhitington commented 2 years ago

You can use cpdf -list-fonts. You would have to build a list of font names which correspond to the 14 standard fonts, and remember to strip subsetting prefixes.

-remove-fonts just removes the actual font file from the PDF, leaving the PDF font metadata. It should be used before -squeeze, at any point in the order.

At some point in the future, cpdf will gain the ability to process images through external processes, but it doesn't have it yet. I made a feature request here: https://github.com/johnwhitington/cpdf-source/issues/244

necros2k7 commented 2 years ago

Can we extract graphics from pdf losslessly - optimize it and then reinsert to source pdf again without text data loss?

johnwhitington commented 2 years ago

No, that sort of round-tripping is what I suggest in https://github.com/johnwhitington/cpdf-source/issues/244