KBNLresearch / pdfquad

Apache License 2.0
1 stars 0 forks source link

Image-level vs PDF object level characteristics #7

Open bitsgalore opened 1 month ago

bitsgalore commented 1 month ago

Several characteristics (resolution, ICC -profiles) can be defined at either the image level (e.g. ICC profile embedded in JPEG) or the PDF object level. And possibly they might not even be the same.

Might be helpful to do a detailed breakdown of a few examples to get a better grip on this. E.g.:

Examples could then be included in documentation, or a blog post.

bitsgalore commented 1 month ago

Comparison PDF objects vs embedded image data

ICC profiles

BKT-ecur002glas01_01.pdf

Pdfimages:

pdfimages -list BKT-ecur002glas01_01.pdf

Output (edited down to one image at page 10):

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
  10     9 image    1556  2400  icc     3   8  jpeg   no        79  0   150   150  260K 2.4%

Value of color indicates ICC profile.

So let's extract the Image XObject that represents this image (using object value):

mutool show BKT-ecur002glas01_01.pdf 79 > 79.dat

Result:

79 0 obj
<<
  /Width 1556
  /BitsPerComponent 8
  /Name /Im0
  /Height 2400
  /Subtype /Image
  /Filter [ /DCTDecode ]
  /Length 265866
  /ColorSpace 77 0 R
  /Type /XObject
>>
stream
...
endstream
endobj

Notice that ColorSpace is defined through a referenced object (77). So let's extract this object as well:

mutool show BKT-ecur002glas01_01.pdf 77 > 77.dat

Result:

77 0 obj
[ /ICCBased 78 0 R ]
endobj

As per 8.6.5.5 (ICCBased Colour Spaces) of ISO 32000-1, this indicates an ICCBased colour space, where the stream (defined by object 78) contains the ICC profile. So let's extract this:

mutool show BKT-ecur002glas01_01.pdf 78 > 78.dat

Result:

78 0 obj
<<
  /Filter /ASCII85Decode
  /N 3
  /Alternate /DeviceRGB
  /Length 513
>>
stream
...
endstream
endobj

We can then extract the ICC profile using:

mutool show -b -o 78-stream.dat BKT-ecur002glas01_01.pdf 78

Then use ExifTool to inspect its properties:

exiftool -X 78-stream.dat > 78-stream.xml

Result:

<?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>

<rdf:Description rdf:about='78-stream.dat'
  xmlns:et='http://ns.exiftool.org/1.0/' et:toolkit='Image::ExifTool 12.60'
  xmlns:ExifTool='http://ns.exiftool.org/ExifTool/1.0/'
  xmlns:System='http://ns.exiftool.org/File/System/1.0/'
  xmlns:File='http://ns.exiftool.org/File/1.0/'
  xmlns:ICC-header='http://ns.exiftool.org/ICC_Profile/ICC-header/1.0/'
  xmlns:ICC_Profile='http://ns.exiftool.org/ICC_Profile/ICC_Profile/1.0/'>
 <ExifTool:ExifToolVersion>12.60</ExifTool:ExifToolVersion>
 <System:FileName>78-stream.dat</System:FileName>
 <System:Directory>.</System:Directory>
 <System:FileSize>560 bytes</System:FileSize>
 <System:FileModifyDate>2024:09:26 14:10:21+00:00</System:FileModifyDate>
 <System:FileAccessDate>2024:09:26 14:10:29+00:00</System:FileAccessDate>
 <System:FileInodeChangeDate>2024:09:26 14:10:21+00:00</System:FileInodeChangeDate>
 <System:FilePermissions>-rw-rw-r--</System:FilePermissions>
 <File:FileType>ICC</File:FileType>
 <File:FileTypeExtension>icc</File:FileTypeExtension>
 <File:MIMEType>application/vnd.iccprofile</File:MIMEType>
 <ICC-header:ProfileCMMType>Little CMS</ICC-header:ProfileCMMType>
 <ICC-header:ProfileVersion>2.1.0</ICC-header:ProfileVersion>
 <ICC-header:ProfileClass>Display Device Profile</ICC-header:ProfileClass>
 <ICC-header:ColorSpaceData>RGB </ICC-header:ColorSpaceData>
 <ICC-header:ProfileConnectionSpace>XYZ </ICC-header:ProfileConnectionSpace>
 <ICC-header:ProfileDateTime>2000:08:11 19:51:59</ICC-header:ProfileDateTime>
 <ICC-header:ProfileFileSignature>acsp</ICC-header:ProfileFileSignature>
 <ICC-header:PrimaryPlatform>Microsoft Corporation</ICC-header:PrimaryPlatform>
 <ICC-header:CMMFlags>Not Embedded, Independent</ICC-header:CMMFlags>
 <ICC-header:DeviceManufacturer>none</ICC-header:DeviceManufacturer>
 <ICC-header:DeviceModel></ICC-header:DeviceModel>
 <ICC-header:DeviceAttributes>Reflective, Glossy, Positive, Color</ICC-header:DeviceAttributes>
 <ICC-header:RenderingIntent>Perceptual</ICC-header:RenderingIntent>
 <ICC-header:ConnectionSpaceIlluminant>0.9642 1 0.82491</ICC-header:ConnectionSpaceIlluminant>
 <ICC-header:ProfileCreator>Little CMS</ICC-header:ProfileCreator>
 <ICC-header:ProfileID>0</ICC-header:ProfileID>
 <ICC_Profile:ProfileCopyright>Copyright 2000 Adobe Systems Incorporated</ICC_Profile:ProfileCopyright>
 <ICC_Profile:ProfileDescription>Adobe RGB (1998)</ICC_Profile:ProfileDescription>
 <ICC_Profile:MediaWhitePoint>0.95045 1 1.08905</ICC_Profile:MediaWhitePoint>
 <ICC_Profile:MediaBlackPoint>0 0 0</ICC_Profile:MediaBlackPoint>
 <ICC_Profile:RedTRC>(Binary data 14 bytes, use -b option to extract)</ICC_Profile:RedTRC>
 <ICC_Profile:GreenTRC>(Binary data 14 bytes, use -b option to extract)</ICC_Profile:GreenTRC>
 <ICC_Profile:BlueTRC>(Binary data 14 bytes, use -b option to extract)</ICC_Profile:BlueTRC>
 <ICC_Profile:RedMatrixColumn>0.60974 0.31111 0.01947</ICC_Profile:RedMatrixColumn>
 <ICC_Profile:GreenMatrixColumn>0.20528 0.62567 0.06087</ICC_Profile:GreenMatrixColumn>
 <ICC_Profile:BlueMatrixColumn>0.14919 0.06322 0.74457</ICC_Profile:BlueMatrixColumn>
</rdf:Description>
</rdf:RDF>

Now let's have a look at the actual JPEG file that is embedded as part of object 79. First we extract the raw datastream from the Image XObject:

mutool show -be -o 79-stream.dat BKT-ecur002glas01_01.pdf 79

The resulting file 79-stream.dat is actually a JPEG image, so let's analyse that with ExifTool:

exiftool -X 79-stream.dat > 79-stream.xml

Result:

<?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>

<rdf:Description rdf:about='79-stream.dat'
  xmlns:et='http://ns.exiftool.org/1.0/' et:toolkit='Image::ExifTool 12.60'
  xmlns:ExifTool='http://ns.exiftool.org/ExifTool/1.0/'
  xmlns:System='http://ns.exiftool.org/File/System/1.0/'
  xmlns:File='http://ns.exiftool.org/File/1.0/'
  xmlns:JFIF='http://ns.exiftool.org/JFIF/JFIF/1.0/'
  xmlns:ICC-header='http://ns.exiftool.org/ICC_Profile/ICC-header/1.0/'
  xmlns:ICC_Profile='http://ns.exiftool.org/ICC_Profile/ICC_Profile/1.0/'
  xmlns:Composite='http://ns.exiftool.org/Composite/1.0/'>
 <ExifTool:ExifToolVersion>12.60</ExifTool:ExifToolVersion>
 <System:FileName>79-stream.dat</System:FileName>
 <System:Directory>.</System:Directory>
 <System:FileSize>266 kB</System:FileSize>
 <System:FileModifyDate>2024:09:26 14:02:04+00:00</System:FileModifyDate>
 <System:FileAccessDate>2024:09:26 14:02:04+00:00</System:FileAccessDate>
 <System:FileInodeChangeDate>2024:09:26 14:02:04+00:00</System:FileInodeChangeDate>
 <System:FilePermissions>-rw-rw-r--</System:FilePermissions>
 <File:FileType>JPEG</File:FileType>
 <File:FileTypeExtension>jpg</File:FileTypeExtension>
 <File:MIMEType>image/jpeg</File:MIMEType>
 <File:ImageWidth>1556</File:ImageWidth>
 <File:ImageHeight>2400</File:ImageHeight>
 <File:EncodingProcess>Baseline DCT, Huffman coding</File:EncodingProcess>
 <File:BitsPerSample>8</File:BitsPerSample>
 <File:ColorComponents>3</File:ColorComponents>
 <File:YCbCrSubSampling>YCbCr4:4:4 (1 1)</File:YCbCrSubSampling>
 <JFIF:JFIFVersion>1.01</JFIF:JFIFVersion>
 <JFIF:ResolutionUnit>None</JFIF:ResolutionUnit>
 <JFIF:XResolution>150</JFIF:XResolution>
 <JFIF:YResolution>150</JFIF:YResolution>
 <ICC-header:ProfileCMMType>Little CMS</ICC-header:ProfileCMMType>
 <ICC-header:ProfileVersion>2.1.0</ICC-header:ProfileVersion>
 <ICC-header:ProfileClass>Display Device Profile</ICC-header:ProfileClass>
 <ICC-header:ColorSpaceData>RGB </ICC-header:ColorSpaceData>
 <ICC-header:ProfileConnectionSpace>XYZ </ICC-header:ProfileConnectionSpace>
 <ICC-header:ProfileDateTime>2000:08:11 19:51:59</ICC-header:ProfileDateTime>
 <ICC-header:ProfileFileSignature>acsp</ICC-header:ProfileFileSignature>
 <ICC-header:PrimaryPlatform>Microsoft Corporation</ICC-header:PrimaryPlatform>
 <ICC-header:CMMFlags>Not Embedded, Independent</ICC-header:CMMFlags>
 <ICC-header:DeviceManufacturer>none</ICC-header:DeviceManufacturer>
 <ICC-header:DeviceModel></ICC-header:DeviceModel>
 <ICC-header:DeviceAttributes>Reflective, Glossy, Positive, Color</ICC-header:DeviceAttributes>
 <ICC-header:RenderingIntent>Perceptual</ICC-header:RenderingIntent>
 <ICC-header:ConnectionSpaceIlluminant>0.9642 1 0.82491</ICC-header:ConnectionSpaceIlluminant>
 <ICC-header:ProfileCreator>Little CMS</ICC-header:ProfileCreator>
 <ICC-header:ProfileID>0</ICC-header:ProfileID>
 <ICC_Profile:ProfileCopyright>Copyright 2000 Adobe Systems Incorporated</ICC_Profile:ProfileCopyright>
 <ICC_Profile:ProfileDescription>Adobe RGB (1998)</ICC_Profile:ProfileDescription>
 <ICC_Profile:MediaWhitePoint>0.95045 1 1.08905</ICC_Profile:MediaWhitePoint>
 <ICC_Profile:MediaBlackPoint>0 0 0</ICC_Profile:MediaBlackPoint>
 <ICC_Profile:RedTRC>(Binary data 14 bytes, use -b option to extract)</ICC_Profile:RedTRC>
 <ICC_Profile:GreenTRC>(Binary data 14 bytes, use -b option to extract)</ICC_Profile:GreenTRC>
 <ICC_Profile:BlueTRC>(Binary data 14 bytes, use -b option to extract)</ICC_Profile:BlueTRC>
 <ICC_Profile:RedMatrixColumn>0.60974 0.31111 0.01947</ICC_Profile:RedMatrixColumn>
 <ICC_Profile:GreenMatrixColumn>0.20528 0.62567 0.06087</ICC_Profile:GreenMatrixColumn>
 <ICC_Profile:BlueMatrixColumn>0.14919 0.06322 0.74457</ICC_Profile:BlueMatrixColumn>
 <Composite:ImageSize>1556x2400</Composite:ImageSize>
 <Composite:Megapixels>3.7</Composite:Megapixels>
</rdf:Description>
</rdf:RDF>

This shows that the JPEG data contains an embedded ICC profile.

So summarising the ICC profile is defined twice here: once for the Image XObject that represents the image at the PDF level, and once at the level of the embedded JPEG. The above ExifTool output shows that the ICC profile is identical in both cases.

kort004mult01_01_50.pdf

Pdfimages:

pdfimages -list kort004mult01_01_50.pdf

Output (edited down to one image at page 5):

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   5     4 image    1961  2884  rgb     3   8  jpeg   no        12  0   301   301  108K 0.7%

Value of colr indicates no ICC profile at the PDF Image XObject level. So let's have a look at the object using:

mutool show kort004mult01_01_50.pdf 12 > 12.dat

Result:

12 0 obj
<<
  /BitsPerComponent 8
  /ColorSpace /DeviceRGB
  /Filter [ /DCTDecode ]
  /Height 2884
  /Length 110562
  /Subtype /Image
  /Type /XObject
  /Width 1961
>>
stream
...
endstream
endobj

So color space is defined as "DeviceRGB". Extract object stream data again:

mutool show -be -o 12-stream.dat kort004mult01_01_50.pdf 12

Analyse with ExifTool:

exiftool -X 12-stream.dat > 12-stream.xml

Result:

<?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>

<rdf:Description rdf:about='12-stream.dat'
  xmlns:et='http://ns.exiftool.org/1.0/' et:toolkit='Image::ExifTool 12.60'
  xmlns:ExifTool='http://ns.exiftool.org/ExifTool/1.0/'
  xmlns:System='http://ns.exiftool.org/File/System/1.0/'
  xmlns:File='http://ns.exiftool.org/File/1.0/'
  xmlns:JFIF='http://ns.exiftool.org/JFIF/JFIF/1.0/'
  xmlns:ICC-header='http://ns.exiftool.org/ICC_Profile/ICC-header/1.0/'
  xmlns:ICC_Profile='http://ns.exiftool.org/ICC_Profile/ICC_Profile/1.0/'
  xmlns:Composite='http://ns.exiftool.org/Composite/1.0/'>
 <ExifTool:ExifToolVersion>12.60</ExifTool:ExifToolVersion>
 <System:FileName>12-stream.dat</System:FileName>
 <System:Directory>.</System:Directory>
 <System:FileSize>111 kB</System:FileSize>
 <System:FileModifyDate>2024:09:26 15:32:52+00:00</System:FileModifyDate>
 <System:FileAccessDate>2024:09:26 15:32:55+00:00</System:FileAccessDate>
 <System:FileInodeChangeDate>2024:09:26 15:32:52+00:00</System:FileInodeChangeDate>
 <System:FilePermissions>-rw-rw-r--</System:FilePermissions>
 <File:FileType>JPEG</File:FileType>
 <File:FileTypeExtension>jpg</File:FileTypeExtension>
 <File:MIMEType>image/jpeg</File:MIMEType>
 <File:ImageWidth>1961</File:ImageWidth>
 <File:ImageHeight>2884</File:ImageHeight>
 <File:EncodingProcess>Baseline DCT, Huffman coding</File:EncodingProcess>
 <File:BitsPerSample>8</File:BitsPerSample>
 <File:ColorComponents>3</File:ColorComponents>
 <File:YCbCrSubSampling>YCbCr4:2:0 (2 2)</File:YCbCrSubSampling>
 <JFIF:JFIFVersion>1.02</JFIF:JFIFVersion>
 <JFIF:ResolutionUnit>inches</JFIF:ResolutionUnit>
 <JFIF:XResolution>300</JFIF:XResolution>
 <JFIF:YResolution>300</JFIF:YResolution>
 <ICC-header:ProfileCMMType>Adobe Systems Inc.</ICC-header:ProfileCMMType>
 <ICC-header:ProfileVersion>2.4.0</ICC-header:ProfileVersion>
 <ICC-header:ProfileClass>Display Device Profile</ICC-header:ProfileClass>
 <ICC-header:ColorSpaceData>RGB </ICC-header:ColorSpaceData>
 <ICC-header:ProfileConnectionSpace>XYZ </ICC-header:ProfileConnectionSpace>
 <ICC-header:ProfileDateTime>2007:03:02 10:07:41</ICC-header:ProfileDateTime>
 <ICC-header:ProfileFileSignature>acsp</ICC-header:ProfileFileSignature>
 <ICC-header:PrimaryPlatform>Unknown ()</ICC-header:PrimaryPlatform>
 <ICC-header:CMMFlags>Not Embedded, Independent</ICC-header:CMMFlags>
 <ICC-header:DeviceManufacturer></ICC-header:DeviceManufacturer>
 <ICC-header:DeviceModel></ICC-header:DeviceModel>
 <ICC-header:DeviceAttributes>Reflective, Glossy, Positive, Color</ICC-header:DeviceAttributes>
 <ICC-header:RenderingIntent>Perceptual</ICC-header:RenderingIntent>
 <ICC-header:ConnectionSpaceIlluminant>0.9642 1 0.82491</ICC-header:ConnectionSpaceIlluminant>
 <ICC-header:ProfileCreator>basICColor GmbH</ICC-header:ProfileCreator>
 <ICC-header:ProfileID>0</ICC-header:ProfileID>
 <ICC_Profile:ProfileCopyright>Copyright (C) 2007 by Color Solutions, All Rights Reserved. License details can be found on: http://www.eci.org/eci/en/eciRGB.php</ICC_Profile:ProfileCopyright>
 <ICC_Profile:ProfileDescription>eciRGB v2</ICC_Profile:ProfileDescription>
 <ICC_Profile:MediaWhitePoint>0.9642 1 0.82491</ICC_Profile:MediaWhitePoint>
 <ICC_Profile:RedTRC>(Binary data 1412 bytes, use -b option to extract)</ICC_Profile:RedTRC>
 <ICC_Profile:GreenTRC>(Binary data 1412 bytes, use -b option to extract)</ICC_Profile:GreenTRC>
 <ICC_Profile:BlueTRC>(Binary data 1412 bytes, use -b option to extract)</ICC_Profile:BlueTRC>
 <ICC_Profile:RedMatrixColumn>0.65027 0.32028 0</ICC_Profile:RedMatrixColumn>
 <ICC_Profile:GreenMatrixColumn>0.17804 0.60205 0.06783</ICC_Profile:GreenMatrixColumn>
 <ICC_Profile:BlueMatrixColumn>0.13588 0.07767 0.75708</ICC_Profile:BlueMatrixColumn>
 <Composite:ImageSize>1961x2884</Composite:ImageSize>
 <Composite:Megapixels>5.7</Composite:Megapixels>
</rdf:Description>
</rdf:RDF>

Which shows the JPEG contains an embedded ICC profile.

So in this case, ICC profile is only embedded at the JPEG level, and not at the PDF (Image XObject) level.

Other properties

Using kort004mult01_01_50.pdf as an example again. Pdfimages output for one image:

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   5     4 image    1961  2884  rgb     3   8  jpeg   no        12  0   301   301  108K 0.7%

And (again) the corresponding Image XObject:

12 0 obj
<<
  /BitsPerComponent 8
  /ColorSpace /DeviceRGB
  /Filter [ /DCTDecode ]
  /Height 2884
  /Length 110562
  /Subtype /Image
  /Type /XObject
  /Width 1961
>>
stream
...
endstream
endobj

Most of the properties reported by pdfimages follow directly from the Image XObject's dictionary entries (see ISO 32000-1, section 8.9.5.1):

Pdfimages property Dictionary entry
width Width
height Height
color ColorSpace
bpc BitsPerComponent
enc Filter
interp Interpolate
ID ID

It's not entirely clear to me what the comp (number of color components) value is based on, as there's no corresponding Image Dictionary entry. The same is true for the x-ppi and y-ppi values.

From the source code it seems that Pdfimages calculates x-ppi and y-ppi from the image dimensions relative to the page size (although it's not entirely clear to me what the code does exactly).

Also worth mentioning that in this case the reported x-ppi and y-ppi values are marginally different from the values in the JPEG header fields:

 <JFIF:XResolution>300</JFIF:XResolution>
 <JFIF:YResolution>300</JFIF:YResolution>