Closed bitsgalore closed 4 weeks ago
Image extraction + Exiftool analysis done:
https://github.com/KBNLresearch/pdfbatchqa/commit/b9013e86d86ff69212aa91d02baf1c6013d39986
TODO: add Schematron rules. Also, this does slow things down quite a bit and the Exiftool output is quite bulky, so perhaps restrict output to limited number of properties? E.g. command below restricts output to ICC-header and ICC-Profile properties:
exiftool -X -ICC-header:all -ICC_Profile:all ./kort004mult01_01_50/-034.jpg
Also: perhaps make image extraction optional?
Also note: images output can be matched with corresponding pdfimages output using num value in pdfimages output. This number corresponds to the numerical value in the extracted file name. E.g. this (pdfimages):
<image>
<page>6</page>
<num>5</num>
<type>image</type>
<width>1986</width>
<height>2895</height>
<color>rgb</color>
<comp>3</comp>
<bpc>8</bpc>
<enc>jpeg</enc>
<interp>no</interp>
<object>15</object>
<ID>0</ID>
<x-ppi>301</x-ppi>
<y-ppi>301</y-ppi>
<size>105K</size>
<ratio>0.6%</ratio>
</image>
Corrsponds with (ExifTool):
<rdf:Description xmlns:et="http://ns.exiftool.org/1.0/" xmlns:ExifTool="http://ns.exiftool.org/ExifTool/1.0/" xmlns:System="http://ns.exiftool.org/File/System/1.0/" xmlns:File="http://ns.exiftool.org/File/1.0/" xmlns:JFIF="http://ns.exiftool.org/JFIF/JFIF/1.0/" xmlns:ICC-header="http://ns.exiftool.org/ICC_Profile/ICC-header/1.0/" xmlns:ICC_Profile="http://ns.exiftool.org/ICC_Profile/ICC_Profile/1.0/" xmlns:Composite="http://ns.exiftool.org/Composite/1.0/" rdf:about="/tmp/tmpu6j209r7/crap-005.jpg" et:toolkit="Image::ExifTool 12.60">
<ExifTool:ExifToolVersion>12.60</ExifTool:ExifToolVersion>
<System:FileName>crap-005.jpg</System:FileName>
<System:Directory>/tmp/tmpu6j209r7</System:Directory>
<System:FileSize>108 kB</System:FileSize>
<System:FileModifyDate>2024:09:24 17:32:07+00:00</System:FileModifyDate>
<System:FileAccessDate>2024:09:24 17:32:07+00:00</System:FileAccessDate>
<System:FileInodeChangeDate>2024:09:24 17:32:07+00:00</System:FileInodeChangeDate>
<System:FilePermissions>-rw-rw-r--</System:FilePermissions>
<File:FileType>JPEG</File:FileType>
<File:FileTypeExtension>jpg</File:FileTypeExtension>
<File:MIMEType>image/jpeg</File:MIMEType>
<File:ImageWidth>1986</File:ImageWidth>
<File:ImageHeight>2895</File:ImageHeight>
<File:EncodingProcess>Baseline DCT, Huffman coding</File:EncodingProcess>
<File:BitsPerSample>8</File:BitsPerSample>
<File:ColorComponents>3</File:ColorComponents>
<File:YCbCrSubSampling>YCbCr4:2:0 (2 2)</File:YCbCrSubSampling>
<JFIF:JFIFVersion>1.02</JFIF:JFIFVersion>
<JFIF:ResolutionUnit>inches</JFIF:ResolutionUnit>
<JFIF:XResolution>300</JFIF:XResolution>
<JFIF:YResolution>300</JFIF:YResolution>
<ICC-header:ProfileCMMType>Adobe Systems Inc.</ICC-header:ProfileCMMType>
<ICC-header:ProfileVersion>2.4.0</ICC-header:ProfileVersion>
<ICC-header:ProfileClass>Display Device Profile</ICC-header:ProfileClass>
<ICC-header:ColorSpaceData>RGB </ICC-header:ColorSpaceData>
<ICC-header:ProfileConnectionSpace>XYZ </ICC-header:ProfileConnectionSpace>
<ICC-header:ProfileDateTime>2007:03:02 10:07:41</ICC-header:ProfileDateTime>
<ICC-header:ProfileFileSignature>acsp</ICC-header:ProfileFileSignature>
<ICC-header:PrimaryPlatform>Unknown ()</ICC-header:PrimaryPlatform>
<ICC-header:CMMFlags>Not Embedded, Independent</ICC-header:CMMFlags>
<ICC-header:DeviceManufacturer/>
<ICC-header:DeviceModel/>
<ICC-header:DeviceAttributes>Reflective, Glossy, Positive, Color</ICC-header:DeviceAttributes>
<ICC-header:RenderingIntent>Perceptual</ICC-header:RenderingIntent>
<ICC-header:ConnectionSpaceIlluminant>0.9642 1 0.82491</ICC-header:ConnectionSpaceIlluminant>
<ICC-header:ProfileCreator>basICColor GmbH</ICC-header:ProfileCreator>
<ICC-header:ProfileID>0</ICC-header:ProfileID>
<ICC_Profile:ProfileCopyright>Copyright (C) 2007 by Color Solutions, All Rights Reserved. License details can be found on: http://www.eci.org/eci/en/eciRGB.php</ICC_Profile:ProfileCopyright>
<ICC_Profile:ProfileDescription>eciRGB v2</ICC_Profile:ProfileDescription>
<ICC_Profile:MediaWhitePoint>0.9642 1 0.82491</ICC_Profile:MediaWhitePoint>
<ICC_Profile:RedTRC>(Binary data 1412 bytes, use -b option to extract)</ICC_Profile:RedTRC>
<ICC_Profile:GreenTRC>(Binary data 1412 bytes, use -b option to extract)</ICC_Profile:GreenTRC>
<ICC_Profile:BlueTRC>(Binary data 1412 bytes, use -b option to extract)</ICC_Profile:BlueTRC>
<ICC_Profile:RedMatrixColumn>0.65027 0.32028 0</ICC_Profile:RedMatrixColumn>
<ICC_Profile:GreenMatrixColumn>0.17804 0.60205 0.06783</ICC_Profile:GreenMatrixColumn>
<ICC_Profile:BlueMatrixColumn>0.13588 0.07767 0.75708</ICC_Profile:BlueMatrixColumn>
<Composite:ImageSize>1986x2895</Composite:ImageSize>
<Composite:Megapixels>5.7</Composite:Megapixels>
</rdf:Description>
So we could use this to check if an ICC profile is defined at either the PDF object level (pdfimages output) or the JPEG level (ExifTool output).
Refinement: instead of extracting all images in one go, we can also do this one at a time. E.g. to extract the image at page 10:
pdfimages -f 10 -l 10 -all BKT-ecur002glas01_01.pdf ./images/
We can then create 1 parent element for each image, and add the pdfimages and ExifTool output inside that. This will also make it possible to create Schematron rules that combine properties from both tools.
BUT results will be unexpected if a page contains more than one image, since -f
and -l
define page numbers, not image numbers! But we could check that by running the pdfimages list command one page at a time as well:
pdfimages -f 10 -l 10 -list BKT-ecur002glas01_01.pdf
Perhaps then subdivide the output in "page" elements, which can contain one of more "image" elements. E.g. something like this:
<?xml version='1.0' encoding='UTF-8'?>
<pdfbatchqa>
<file>
<filePath></filePath>
<fileSize></fileSize>
<pdfinfo>
</pdfinfo>
<pages>
<page>
<image>
<pdfimages>
</pdfimages>
<exifTool>
</exifTool>
</image>
</page>
</pages>
</file>
</pdfbatchqa>
For this we could:
This way we could e.g. check for ICC profiles that are embedded in the JPEG (pdfimages only detects ICC profiles that are embedded at the PDF object level).