Support lossless compression of images inside PDF files

fulldecent commented 10 years ago

Similar to compression of images inside ZIP files

kornelski commented 10 years ago

Do you know any tool that can take PDF apart and then put it back together with new images?

fulldecent commented 10 years ago

PDFBox can:

https://pdfbox.apache.org/ Note: https://stackoverflow.com/questions/17423665/replace-image-in-pdf-with-another-image-pdf-box

Other links:

Can extract images: pdfimages http://cgit.freedesktop.org/poppler/poppler/tree/utils
Can sanitize PDF (lossy): http://www.decalage.info/python/origapy
Ghostscript can rewrite PDF
Adobe Acrobat has "replace image" feature http://helpx.adobe.com/acrobat/using/edit-images-or-objects-pdf.html
Can replace images: http://www.enfocus.com/en/pdf-editing/place-and-replace-images-in-a-pdf/

kornelski commented 10 years ago

Cool, it might be possible :) http://esec-lab.sogeti.com/pages/Origami

fulldecent commented 10 years ago

Upstream issue with Origami https://code.google.com/p/origami-pdf/issues/detail?id=24&thanks=24&ts=1403971155

Jellyfrog commented 10 years ago

+1

fulldecent commented 9 years ago

Please star these upstream issues to keep up with progress on this issue (and to help raise awareness in those projects):

Also, I just reached out to Philippe Lagadec from http://www.decalage.info/python/origapy#attachments

fulldecent commented 9 years ago

Another option here might be https://github.com/itext/itextpdf

boazsegev commented 9 years ago

I wrote a Ruby gem, called combine_pdf, that takes PDF files apart and puts them back together (usually after injecting more data such as pages, watermarks etc' into the PDF).

Adding image manipulation (inflate, deflate using new filters and update the filters applied in the :Filter object data) shouldn't be too difficult...

It is a simple matter of collecting all the PDF Image objects and updating their :raw_stream_content and :Filter properties

require 'combine_pdf'
# load the pdf - can also be done from memory using the #parse method
pdf = CombinePDF.load 'original_file.pdf'

# iterate over all the PDF objects and update any Image objects
pdf.objects.each do |obj|
  if obj[:Subtype] == :Image
    # do stuff here to the obj[:raw_stream_content] and obj[:Filter] data...
  end
end

# save the data - can also be done to memory using the #to_pdf method
pdf.save 'new_file.pdf'

...but my gem is still a bit limited regarding compression and encryption, which I know very little about. So I doubt it will work with any PDF out there (we can increase support for decryption and deflation together, or you could patch it in your own code if you want).

fulldecent commented 9 years ago

Would you be able to help edit this script so that it can extract all the images with:

./script sample.pdf unarchive outputdir/

Then we modify all the images in outputdir/ and then combine everything back together with:

./script sample-new.pdf archive outputdir/

To reconstitute the PDF using the new images?

Ideally the program would return non-zero if it encountered a situation it can't handle, rather than creating a broken file. We are very interested in being a safe program. Would rather leave the file untouched than decrypt something we can't re-encrypt.

boazsegev commented 9 years ago

I wrote this for you, look at the code and the output and tell me what you might need to use it.

Please notice that there is limited inflation support (so the script will fail with some PDF files). It would be nice to get some help with that, as I know very little about data compression.

Also notice that the deflation support is missing for now (deflate_object will have no effect). Let me know what you might want the method to do. I can implement a basic zlib deflation quite easily...

Last I will point out that not all encrypted PDF files are currently supported (wouldn't mind for some help with that too).

These issues apply to the minority of situations, but I think it's good to know.

Just save the following code to a file called script and use chmod +x ./script to update it's permissions. It should work as par your specifications.

#!/usr/bin/env ruby
# encoding: UTF-8

require 'combine_pdf'
require 'json'

if ARGV[0][0..1] == '-h' || ARGV[0][0] == '?' || !( File.exists?(ARGV[0]) && !File.directory?(ARGV[0]) && File.exists?(ARGV[2]) && File.directory?(ARGV[2]) )
    puts 'use:'
    puts './script sample.pdf unarchive outputdir/'
    puts 'or:'
    puts './script sample-new.pdf archive outputdir/'
    puts ''
    puts 'please notice:'
    puts '- pdf file must exist for both unarchive and archive commands.'
    puts "- outputdir must be an existing folder. it's content will be overwritten."
    exit -1
end

# load the pdf - can also be done from memory using the #parse method
begin
   pdf = CombinePDF.load ARGV[0]
rescue Exception => e
   puts "Couldn't open file - #{e.message}"
   exit -1   
end

i = 0
# iterate over all the PDF objects and update any Image objects
pdf.objects.each do |obj|
  if obj[:Subtype] == :Image
    # archive script
    if ARGV[1][0] == 'a'

        begin
            # clear existing data - as all data will be loaded from the existing meta-data file
            obj.clear
            # rewrite the raw stream
            obj[:raw_stream_content] = IO.binread File.join(ARGV[2], "image_#{i = i + 1}.data")
            # rewrite the meta-data in the exported ruby meta-data file (in case the data was updated).
            obj.update eval(IO.read(File.join(ARGV[2], "image_#{i}.meta.rb")))
            ## unsupported just yet, but tell me if you feel it's important
            ## or if you will do it yourself (editing the meta-data)
            CombinePDF::PDFFilter.deflate_object obj
        rescue Exception => e
            puts 'Sorry, an unknown error has occured.'
            exit -1
        end

    # unarchive script
    elsif ARGV[1][0] == 'u'

        begin
            CombinePDF::PDFFilter.inflate_object obj
            IO.binwrite File.join(ARGV[2], "image_#{i = i + 1}.data"), obj.delete(:raw_stream_content)
            IO.binwrite File.join(ARGV[2], "image_#{i}.meta.rb"), obj.to_s
        rescue Exception => e
            puts 'Sorry, an error has occured.'
            puts "It is possible that we couldn't unarchive some of the images or that the file is encrypted."
            puts "Error message: #{e.message}"
            exit -1
        end

    end
  end
end

# save the data, if archiving
if ARGV[1][0] == 'a'
    begin
        pdf.save ARGV[0]
    rescue Exception => e
        puts "Sorry, couldn't write file - #{e.message}"
        exit -1
    end
end

exit 0

fulldecent commented 9 years ago

This looks wonderful. @pornel any feedback?

kornelski commented 9 years ago

This looks cool. Thanks!

Right now I'm busy with development of mozjpeg and JPEG XT.

When I'm done with these I plan to look into architecture of ImageOptim to make such optimizations possible (currently ImageOptim assumes one optimization = one file, but for PDF and such it needs to be one-to-many, possibly a tree.) and then I'll be adding PDF support.

boazsegev commented 9 years ago

Cool, Good luck with your projects.

You can always get back to me with any questions or requests...

joelkesler commented 7 years ago

Hi @pornel - any news if this is still something you'd be interested in pursuing for ImageOptim?

kornelski commented 7 years ago

In general, yes. I've started a bit of refactorings, but there's much more work to do on that.

joelkesler commented 7 years ago

Understandable. I tried @boazsegev 's suggestion above, and while it did let me remove the assets so I could run them thought ImageOptim, when I printed the optimized pdf, everything went magenta. Not sure why, perhaps color profile information was lost or corrupted. Or our printer is the devil. The PDFs displayed fine however.

boazsegev commented 7 years ago

@joelkesler - that's interesting. Did you update the color profile and metadata in the image_XXX.meta.rb files? Did you format the data in a PDF friendly manner?

I'd guess there was a change from CMYK to RGB (or vice versa) during the image optimization?

P.S.

I'm assuming your printer is fine... but I might be wrong ;-)

joelkesler commented 7 years ago

Hmm, I looked at the image before and after ImageOptim went though them and I did not find any oddness but I may not know where to look.

Attached is a zip file with the before PDF, the after PDF, the script, and the output from the script after being run though ImageOptim but before being re-PDF'd.

PDF ImageOptim Experiment.zip

If you do not have time to look at it, no worries :)

boazsegev commented 7 years ago

Yeah, I'm a bit swamped with my projects, but I had a quick look.

The PDF image color space is unchanged between the Before.pdf and After.pdf. It's ICC based (meaning it's detailed in an attached file).

I don't think the metadata in the metadata ruby files was updated in response to the image format change... I only saw an update to the "Length" field.

The image was totally different though, the compressed one being JFIF (JPEG) compressed (I couldn't tell what the original was, but it was different).

JPEG doesn't define a colorspace, which is often device dependent unless a colorspace metadata (ICC ) is attached. JPEG2000 has data attached, but I'm not sure if that's true to this format.

To fix the metadata, the ICC data should be separated from the JPEG image and used to replace the existing ICC metadata.

Another option is to remove the ICC metadata altogether and utilize only a device dependent interpretation, using an appropriate colorspace keyword such as DeviceRGB, DeviceCMYK, DeviceGray, etc'... (see section 8.6.3 to the PDF specification).

I'm not very good with image formats, but if the image has JPEG2000 data, the best option might be to remove the ColorSpace data altogether and get the :Filter to :JPXDecode, as I understand from section 8.9.5.1 to the specification:

ColorSpace

name or array

(Required for images, except those that use the JPXDecode filter; not allowed forbidden for image masks) The colour space in which image samples shall be specified; it can be any type of colour space except Pattern.

If the image uses the JPXDecode filter, this entry may be present:

• If ColorSpace is present, any colour space specifications in the JPEG2000 data shall be ignored.

• If ColorSpace is absent, the colour space specifications in the JPEG2000 data shall be used. The Decode array shall also be ignored unless ImageMask is true.

Anyway, that's all I've got for now.

kornelski commented 7 years ago

I don't know specifics of PDF, but I'd suggest always avoiding Device* profiles. They will make images look awful on wide gamut displays. If you have an option to set sRGB for everything, that will be much more likely to work well.

boazsegev commented 7 years ago

I'm the least knowledgeable among us when it comes to image data, but I can lend a hand when dealing with the PDF format.

From your suggestion is seems that the best option is overwriting the ColorSpace array with the new ICCBased data.

Another thought... The PDF format doesn't specify a DevicesRGB (as far as I know), but it does provide a CalRGB for calibrated tone output.

This ColorSpace keyword object would require extra metadata, specifically the WhitePoint metadata (which is required, the rest is optional).

Also, similar to the way ICC data is a linked object, so should the calRGB be used (i.e. :ColorSpace => [:CalRGB, {is_reference_only: true, referenced_object: {WhitePoint: x, 1, z } }] (x and z are the diffused white point coordinates).

Having said that:

From what I understand, ColorSpace defines the number of bits per color (pixel / pixel group) and the way they are interpreted. I think RGB has 24 bits (3 bytes) per color CMYK has 32 bytes per color, etc'.

I'm wonder whether setting a similar ColorSpace for everything (i.e. an ICC profile for sRGB, which I think should be 36 or 48 bits per color) could cause errors when reading the data? What do you think?

kornelski commented 7 years ago

Colorspace definitions are for specific kinds of color spaces, usually RGB or CMYK or Gray. Colorspace of the profile has to be compatible with the image data, i.e. you can't use an RGB profile for a CMYK JPEG, or CMYK profile for an RGB PNG.

But color profiles are independent of bit depth (8/16-bit per channel) and presence of alpha channel.

BTW, I'll need to test whether ImageOptim handles CMYK gracefully (I'm afraid it doesn't).

CalRGB looks incomplete. To define a colorspace you need all of:

White point (which is almost always D65)
Gamma curve
Coordinates of R, G, and B points in the XYZ color space

Attaching sRGB profile in format of an ICC file may be an option too. The minimal one is 540 bytes.

Also can PDF work with no profile defined at all? Does it fall back to what the image contains? (e.g. sRGB chunk in PNGs?)

boazsegev commented 7 years ago

Thank you for explaining a bit about what a color profile is.

The ColorSpace metadata is required by the specification for all images "except those that use the JPXDecode filter".

I assume most (some?) readers will fallback gracefully when encountering a non-conforming PDF file (if the ColorSpace is missing)...

...however, I think this is both more risky and less desirable than using a specification conforming solution, even if it means using a generic ICC file that's "good enough".

Now that I understand a bit more about color profiles, I assume it would be a simple thing to attach an sRGB color profile, replacing the existing :ICCBased ColorSpace object. As long as we know all the optimized images are (s)RGB, it should probably work fine if the same ICC data is used for all images.

As for CalRGB, the PDF object allows for:

WhitePoint (required)
BlackPoint (optional)
Gamma (optional)
Matrix (optional).

I assume this covers everything you mentioned. Default values are provided for any missing property (i.e., the RGB XYZ Matrix defaults to: [1, 0, 0, 0, 1, 0, 0, 0, 1]).

The D65 WhitePoint should translate to: [ 0.9505, 1.0, 1.089]... but I'm not sure about this - it's the example value for D65 in the PDF specification. According to the same example, Gamma is set at 1.8, resulting in a [1.8, 1.8, 1.8] value... but the example includes an RGB XYZ matrix for "Sony Trinitron phosphor chromaticities", whatever that may be.

kornelski commented 7 years ago

OK, so attaching ICC may be the way to go. Here's a small profile file you can use: srgb.icc

Complete settings for CalRGB make sense, although the gamma given as a number is slightly less precise than what the ICC file can define (it'd have to be x^2.2 for sRGB, but sRGB's curve is a bit different in dark areas).

boazsegev commented 7 years ago

Okay... I updated the script to use the attached ICC file, but I have no idea if it's really working (no printer and all).

This was just a quick hack.

You'd notice I placed the ICC file in the global script. This way we're using the same object (instead of having multiple copies of the ICC profile).

This should save memory during the processing, but CombinePDF should optimize duplicates anyway, so we're getting the same result as if the optimization was performed, just without the intermediate objects and memory cost.

I'm not sure this is the way to go, since I have no idea if all the images are optimized or only some. If a new copy of the ICC file is placed for each processed image, it would allow a more graceful fallback when some images are left unprocessed.

You can find the code I used here: PDF.ImageOptim.Experiment2.zip

The After.pdf file should already contain the updated ICC data... I think.

boazsegev commented 7 years ago

Ooops... I didn't save the script before compressing the folder....

Here's the script:

#!/usr/bin/env ruby
# encoding: UTF-8

# https://github.com/ImageOptim/ImageOptim/issues/49

require 'combine_pdf'
require 'json'

if ARGV[0][0..1] == '-h' || ARGV[0][0] == '?' || !( File.exists?(ARGV[0]) && !File.directory?(ARGV[0]) && File.exists?(ARGV[2]) && File.directory?(ARGV[2]) )
    puts 'use:'
    puts './script sample.pdf unarchive outputdir/'
    puts 'or:'
    puts './script sample-new.pdf archive outputdir/'
    puts ''
    puts 'please notice:'
    puts '- pdf file must exist for both unarchive and archive commands.'
    puts "- outputdir must be an existing folder. it's content will be overwritten."
    exit -1
end

# load the pdf - can also be done from memory using the #parse method
begin
   pdf = CombinePDF.load ARGV[0]
rescue Exception => e
   puts "Couldn't open file - #{e.message}"
   exit -1
end

i = 0
def get_icc_colorspace
  return $icc_data if $icc_data
  icc_file_dump = IO.binread 'imageoptim-srgb.icc'
  $icc_data = [:ICCBased, {:is_reference_only=>true,
                          :referenced_object=> { :Length => icc_file_dump.bytesize,
                                                  :Alternate=>:DeviceRGB,
                                                  :raw_stream_content=> icc_file_dump}]
end
# iterate over all the PDF objects and update any Image objects
pdf.objects.each do |obj|
  if obj[:Subtype] == :Image
    # archive script
    if ARGV[1][0] == 'a'

        begin
            # clear existing data - as all data will be loaded from the existing meta-data file
            obj.clear
            # rewrite the raw stream
            obj[:raw_stream_content] = IO.binread File.join(ARGV[2], "image_#{i = i + 1}.data")
            # rewrite the meta-data in the exported ruby meta-data file (in case the data was updated).
            obj.update eval(IO.read(File.join(ARGV[2], "image_#{i}.meta.rb")))
            obj[:ColorSpace] = get_icc_colorspace
            ## unsupported just yet, but tell me if you feel it's important
            ## or if you will do it yourself (editing the meta-data)
            CombinePDF::PDFFilter.deflate_object obj
        rescue Exception => e
            puts 'Sorry, an unknown error has occured.'
            exit -1
        end

    # unarchive script
    elsif ARGV[1][0] == 'u'

        begin
            CombinePDF::PDFFilter.inflate_object obj
            IO.binwrite File.join(ARGV[2], "image_#{i = i + 1}.data"), obj.delete(:raw_stream_content)
            IO.binwrite File.join(ARGV[2], "image_#{i}.meta.rb"), obj.to_s
        rescue Exception => e
            puts 'Sorry, an error has occured.'
            puts "It is possible that we couldn't unarchive some of the images or that the file is encrypted."
            puts "Error message: #{e.message}"
            exit -1
        end

    end
  end
end

# save the data, if archiving
if ARGV[1][0] == 'a'
    begin
        pdf.save ARGV[0]
    rescue Exception => e
        puts "Sorry, couldn't write file - #{e.message}"
        exit -1
    end
end

exit 0

joelkesler commented 7 years ago

Thanks again for your help @boazsegev and @pornel.

I tried the updated version of the script and found two issues, first the updated code with colorprofile support was missing a closing '}' on line 38 after :raw_stream_content=> icc_file_dump} so I added that in.

def get_icc_colorspace return $icc_data if $icc_data icc_file_dump = IO.binread 'imageoptim-srgb.icc' $icc_data = [:ICCBased, {:is_reference_only=>true, :referenced_object=> { :Length => icc_file_dump.bytesize, :Alternate=>:DeviceRGB, :raw_stream_content=> icc_file_dump}}] end

But when I did that, pdfs built/archived from the script resulted in blank white pages, both on screen and when printed.

Attached are the results for each variation of the test and the test scripts themselves.

PDF Experiment 2 Results.zip

After 1 - Unarchived and Archived - No Colour Profile and No Changes - Prints out Normally.pdf After 2 - Unarchived and Archived - No Colour Profile and ImageOptim Lossless Compression - Prints out Pink.pdf After 3 - boazsegev version from above attachment with addition of Colour Profiles? - Prints out pink.pdf After 4 - Unarchived and Archived - Colour Profile Added and ImageOptim Lossless Compression - Nothing but White.pdf After 4 - Unarchived and Archived - Colour Profile Added and No Changes - Nothing but White.pdf

ylluminate commented 5 years ago

Did this ever get integrated?

krono commented 2 years ago

ImageOptim / ImageOptim

Support lossless compression of images inside PDF files #49