coherentgraphics / cpdf-binaries

PDF Command Line Tools binaries for Linux, Mac, Windows
Other
581 stars 42 forks source link

Error when creating a file from a JPG/Exif image #83

Closed 2bllk closed 7 months ago

2bllk commented 7 months ago

In the manual, paragraph 17.3 «Make a PDF from a PNG or JPEG image» states the following:

Almost any JPEG file may be used:

cpdf -jpeg image.jpg -o out.pdf

When I tried to run this command with a random JPEG image, I got the following error:

jpeg_dimensions: Not a valid SOI header

This error occurs for any images that use the Exif standard rather than JFIF. Most likely, the program code is looking for the JFIF format signature in the file, and if it does not find it, it creates an error.

When I changed the Exif application segment to a JFIF segment in the binary file of an image that failed to convert to PDF, the conversion was successful.

Most likely the image validation in the utility code looks something like this:

"""
Source: https://stackoverflow.com/questions/2517854/getting-image-size-of-jpeg-from-its-binary#:~:text=The%20header%20of%20a%20JPEG,Of%20Frame%2C%20type%20N).
"""
def get_jpeg_size(data):
   """
   Gets the JPEG size from the array of data passed to the function, file reference: http:#www.obrador.com/essentialjpeg/headerinfo.htm
   """
   data_size=len(data)
   #Check for valid JPEG image
   i=0   # Keeps track of the position within the file
   if(data[i] == 0xFF and data[i+1] == 0xD8 and data[i+2] == 0xFF and data[i+3] == 0xE0): 
   # Check for valid JPEG header (null terminated JFIF)
      i += 4
      if(data[i+2] == ord('J') and data[i+3] == ord('F') and data[i+4] == ord('I') and data[i+5] == ord('F') and data[i+6] == 0x00):
         #Retrieve the block length of the first block since the first block will not contain the size of file
         block_length = data[i] * 256 + data[i+1]
         while (i<data_size):
            i+=block_length               #Increase the file index to get to the next block
            if(i >= data_size): return False;   #Check to protect against segmentation faults
            if(data[i] != 0xFF): return False;   #Check that we are truly at the start of another block
            if(data[i+1] == 0xC0):          #0xFFC0 is the "Start of frame" marker which contains the file size
               #The structure of the 0xFFC0 block is quite simple [0xFFC0][ushort length][uchar precision][ushort x][ushort y]
               height = data[i+5]*256 + data[i+6];
               width = data[i+7]*256 + data[i+8];
               return height, width
            else:
               i+=2;                              #Skip the block marker
               block_length = data[i] * 256 + data[i+1]   #Go to the next block
         return False                   #If this point is reached then no size was found
      else:
         return False                  #Not a valid JFIF string
   else:
      return False                     #Not a valid SOI header

with open('path/to/file.jpg','rb') as handle:
   data = handle.read()

h, w = get_jpeg_size(data)
print(s)

In this code, validation is performed as follows: 1) the signature 0xFF 0xD8 0xFF 0xE0 is checked, where:

When I changed the values of 0xFF 0xE1 (Exif segment APP1 marker) to 0xFF 0xE0 (JFIF segment APP0 marker) in the image, and also changed the value of Exif to JFIF at offset 0x06, the image passed the validation in your program.

Please add the possibility to work with JPEG/Exif images. Most likely it will be enough to change the validation process.

johnwhitington commented 7 months ago

Coincidentally, I noticed this yesterday with a MacOS screenshot, I think it was.

Thanks for all the details. Looks like an easy fix.

The code is here: https://github.com/johnwhitington/cpdf-source/blob/master/cpdfjpeg.ml

It's a transliteration into OCaml of the well-known snippet of code you quote, I believe.

2bllk commented 7 months ago

Yes, it is the same code. The comment in your code have the same author as the site listed in the answer on Stack Overflow.

I tweaked the Python code mentioned earlier. The validation works correctly.

def validate_jpeg(data):
    i = 0
    if(data[i] == 0xFF and data[i+1] == 0xD8):
        if (data[i+2] == 0xFF and data[i+3] == 0xE0):
            if (data[i+6] == ord('J') and data[i+7] == ord('F') and data[i+8] == ord('I') and data[i+9] == ord('F') and data[i+10] == 0x00):
                return True # Valid JFIF string
            else:
                return False # Not a valid JFIF string
        else:
            if (data[i+2] == 0xFF and data[i+3] == 0xE1):
                if (data[i+6] == ord('E') and data[i+7] == ord('x') and data[i+8] == ord('i') and data[i+9] == ord('f') and data[i+10] == 0x00):
                    return True # Valid Exif string
                else:
                    return False # Not a valid Exif string
            else:
                return False # Have'nt valid JFIF or Exif block
    else:
        return False #Not a valid SOI header

def get_jpeg_size(data):
    """
    Gets the JPEG size from the array of data passed to the function, file reference: http:#www.obrador.com/essentialjpeg/headerinfo.htm
    """
    #Check for valid JPEG image
    if(not validate_jpeg(data)):
        return False

    data_size=len(data)

    i=4   # Keeps track of the position within the file
    #Retrieve the block length of the first block since the first block will not contain the size of file
    block_length = data[i] * 256 + data[i+1]
    while (i<data_size):
        i+=block_length               #Increase the file index to get to the next block
        if(i >= data_size): return False;   #Check to protect against segmentation faults
        if(data[i] != 0xFF): return False;   #Check that we are truly at the start of another block
        if(data[i+1] == 0xC0):          #0xFFC0 is the "Start of frame" marker which contains the file size
            #The structure of the 0xFFC0 block is quite simple [0xFFC0][ushort length][uchar precision][ushort x][ushort y]
            height = data[i+5]*256 + data[i+6];
            width = data[i+7]*256 + data[i+8];
            return height, width
        else:
            i+=2;                              #Skip the block marker
            block_length = data[i] * 256 + data[i+1]   #Go to the next block
    return False                   #If this point is reached then no size was found
johnwhitington commented 7 months ago

Xref https://github.com/coherentgraphics/cpdf-binaries/issues/83

johnwhitington commented 7 months ago

Observation: loading an Exif into Adobe Acrobat and saving as a PDF rewrites the JPEG file to have a JFIF header:

ÿØÿàJFIFÿá+ExifMM*’˜®¶(1¾2Æ<Ú‡iðxAppleiPad (7th generation)HH17.0.32023:11:10 10:38:17iPad (7th generation)

Is it valid, therefore, to include the exif file without rewriting its header?

Off topic, Acrobat also creates a nice metadata stream for the image:

<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 8.0-c001 79.328f76e, 2022/08/01-19:10:29        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:tiff="http://ns.adobe.com/tiff/1.0/"
            xmlns:xmp="http://ns.adobe.com/xap/1.0/"
            xmlns:exif="http://ns.adobe.com/exif/1.0/">
         <tiff:Orientation>1536</tiff:Orientation>
         <tiff:YCbCrPositioning>256</tiff:YCbCrPositioning>
         <tiff:XResolution>1207959552/16777216</tiff:XResolution>
         <tiff:YResolution>1207959552/16777216</tiff:YResolution>
         <tiff:ResolutionUnit>512</tiff:ResolutionUnit>
         <tiff:Make>Apple</tiff:Make>
         <tiff:Model>iPad (7th generation)</tiff:Model>
         <xmp:ModifyDate>2023-11-10T10:38:17Z</xmp:ModifyDate>
         <xmp:CreatorTool>17.0.3</xmp:CreatorTool>
         <xmp:CreateDate>2023-11-10T10:38:17.026Z</xmp:CreateDate>
         <exif:ExifVersion>0.2.3.2</exif:ExifVersion>
2bllk commented 7 months ago

I may have understood what you mean, but not completely. Let me try to explain.

A JPEG is made up of segments. Each segment consists of a header (includes segment type and segment size) and content. Exif and JFIF are also stored as separate segments. These segments are called application segments. They are not involved in the rendering of the image and are only needed for storing and reading data by certain applications that support these segments. For example, Photoshop Save For Web saves the data it needs in the segment with the 0xFFEC marker. Exif is saved in the segment with the 0xFFE1 marker. JFIF is saved in the segment with marker 0xFFE0. There are many such segments, with markers ranging from 0xFFE0 to 0xFFEF.

Both JFIF and Exif require their section to come immediately after the SOI (Start of image) marker. I believe that the image will be valid both if the SOI is followed by the JFIF section, and if the SOI is followed by the Exif section, as well as if the SOI is followed by JFIF and then Exif. There are programs that generate both segments: first Exif, then JFIF. And cpdf does not want to work with such images. Please read this Wikipedia article.

An example of an image downloaded from the internet that has the Exif (highlighted) after the SOI. ![image](https://github.com/coherentgraphics/cpdf-binaries/assets/45295926/1bec110a-00b2-46bd-bcf3-b802b7a57bb9)
Example of an image downloaded from the internet with JFIF (highlighted) after SOI. ![image](https://github.com/coherentgraphics/cpdf-binaries/assets/45295926/f4480d72-25ae-4bba-bf35-7a36c7802e5d)
Example of an image downloaded from the internet with JFIF (highlighted) after SOI followed by Exif. ![image](https://github.com/coherentgraphics/cpdf-binaries/assets/45295926/b569d8c2-c508-4825-92b9-a28c9c053215)
An example of an image taken by my smartphone camera, with SOI followed by Exif (highlighted in red) and followed by JFIF (highlighted in blue). ![image](https://github.com/coherentgraphics/cpdf-binaries/assets/45295926/7ed4f70c-2bbc-4d60-a82f-e8c5fb16ed6d)
johnwhitington commented 7 months ago

Right, but the question here is this: if I alter the code to accept these JPEGs, will PDFs with these JPEGs embedded in them be valid PDFs? The PDF standard merely says:

"The DCTDecode filter decodes grayscale or colour image data that has been encoded in the JPEG baseline format in accordance with ISO/IEC 10918 (all parts)."

From what you have said, and what I have read, it seems that both JFIF and EXIF count as sub-formats of ISO 10918, so I think we're ok.

What concerns me is that Adobe Acrobat, as described above, re-processes an EXIF JPEG before including it.

Note that I'm not worried about whether most PDF viewers will show such JPEGs - of course they will - I'm worried about technical adherence to the standard.

johnwhitington commented 7 months ago

MacOS preview does not reprocess when producing a PDF, so I think I'm happy to implement now, on balance.

johnwhitington commented 7 months ago

Fixed in trunk. On the file I tried it on, the result came out sideways - but we'll leave obeying the Exif metadata until a later time.

Thanks for the report, and the detailed comments.