KBNLresearch / pdfquad

Apache License 2.0
1 stars 0 forks source link

Extract images for more in-depth image analysis #3

Closed bitsgalore closed 4 weeks ago

bitsgalore commented 1 month ago

For this we could:

  1. Use pdfimages to extract all images to temporary folder
  2. Analyse images with e.g. ExifTool, ImageMagick.

This way we could e.g. check for ICC profiles that are embedded in the JPEG (pdfimages only detects ICC profiles that are embedded at the PDF object level).

bitsgalore commented 1 month ago

Image extraction + Exiftool analysis done:

https://github.com/KBNLresearch/pdfbatchqa/commit/b9013e86d86ff69212aa91d02baf1c6013d39986

TODO: add Schematron rules. Also, this does slow things down quite a bit and the Exiftool output is quite bulky, so perhaps restrict output to limited number of properties? E.g. command below restricts output to ICC-header and ICC-Profile properties:

exiftool -X -ICC-header:all -ICC_Profile:all ./kort004mult01_01_50/-034.jpg

Also: perhaps make image extraction optional?

bitsgalore commented 1 month ago

Also note: images output can be matched with corresponding pdfimages output using num value in pdfimages output. This number corresponds to the numerical value in the extracted file name. E.g. this (pdfimages):

<image>
    <page>6</page>
    <num>5</num>
     <type>image</type>
     <width>1986</width>
     <height>2895</height>
     <color>rgb</color>
     <comp>3</comp>
     <bpc>8</bpc>
     <enc>jpeg</enc>
     <interp>no</interp>
     <object>15</object>
     <ID>0</ID>
     <x-ppi>301</x-ppi>
     <y-ppi>301</y-ppi>
     <size>105K</size>
     <ratio>0.6%</ratio>
 </image>

Corrsponds with (ExifTool):

<rdf:Description xmlns:et="http://ns.exiftool.org/1.0/" xmlns:ExifTool="http://ns.exiftool.org/ExifTool/1.0/" xmlns:System="http://ns.exiftool.org/File/System/1.0/" xmlns:File="http://ns.exiftool.org/File/1.0/" xmlns:JFIF="http://ns.exiftool.org/JFIF/JFIF/1.0/" xmlns:ICC-header="http://ns.exiftool.org/ICC_Profile/ICC-header/1.0/" xmlns:ICC_Profile="http://ns.exiftool.org/ICC_Profile/ICC_Profile/1.0/" xmlns:Composite="http://ns.exiftool.org/Composite/1.0/" rdf:about="/tmp/tmpu6j209r7/crap-005.jpg" et:toolkit="Image::ExifTool 12.60">
 <ExifTool:ExifToolVersion>12.60</ExifTool:ExifToolVersion>
 <System:FileName>crap-005.jpg</System:FileName>
 <System:Directory>/tmp/tmpu6j209r7</System:Directory>
 <System:FileSize>108 kB</System:FileSize>
 <System:FileModifyDate>2024:09:24 17:32:07+00:00</System:FileModifyDate>
 <System:FileAccessDate>2024:09:24 17:32:07+00:00</System:FileAccessDate>
 <System:FileInodeChangeDate>2024:09:24 17:32:07+00:00</System:FileInodeChangeDate>
 <System:FilePermissions>-rw-rw-r--</System:FilePermissions>
 <File:FileType>JPEG</File:FileType>
 <File:FileTypeExtension>jpg</File:FileTypeExtension>
 <File:MIMEType>image/jpeg</File:MIMEType>
 <File:ImageWidth>1986</File:ImageWidth>
 <File:ImageHeight>2895</File:ImageHeight>
 <File:EncodingProcess>Baseline DCT, Huffman coding</File:EncodingProcess>
 <File:BitsPerSample>8</File:BitsPerSample>
 <File:ColorComponents>3</File:ColorComponents>
 <File:YCbCrSubSampling>YCbCr4:2:0 (2 2)</File:YCbCrSubSampling>
 <JFIF:JFIFVersion>1.02</JFIF:JFIFVersion>
 <JFIF:ResolutionUnit>inches</JFIF:ResolutionUnit>
 <JFIF:XResolution>300</JFIF:XResolution>
 <JFIF:YResolution>300</JFIF:YResolution>
 <ICC-header:ProfileCMMType>Adobe Systems Inc.</ICC-header:ProfileCMMType>
 <ICC-header:ProfileVersion>2.4.0</ICC-header:ProfileVersion>
 <ICC-header:ProfileClass>Display Device Profile</ICC-header:ProfileClass>
 <ICC-header:ColorSpaceData>RGB </ICC-header:ColorSpaceData>
 <ICC-header:ProfileConnectionSpace>XYZ </ICC-header:ProfileConnectionSpace>
 <ICC-header:ProfileDateTime>2007:03:02 10:07:41</ICC-header:ProfileDateTime>
 <ICC-header:ProfileFileSignature>acsp</ICC-header:ProfileFileSignature>
 <ICC-header:PrimaryPlatform>Unknown ()</ICC-header:PrimaryPlatform>
 <ICC-header:CMMFlags>Not Embedded, Independent</ICC-header:CMMFlags>
 <ICC-header:DeviceManufacturer/>
 <ICC-header:DeviceModel/>
 <ICC-header:DeviceAttributes>Reflective, Glossy, Positive, Color</ICC-header:DeviceAttributes>
 <ICC-header:RenderingIntent>Perceptual</ICC-header:RenderingIntent>
 <ICC-header:ConnectionSpaceIlluminant>0.9642 1 0.82491</ICC-header:ConnectionSpaceIlluminant>
 <ICC-header:ProfileCreator>basICColor GmbH</ICC-header:ProfileCreator>
 <ICC-header:ProfileID>0</ICC-header:ProfileID>
 <ICC_Profile:ProfileCopyright>Copyright (C) 2007 by Color Solutions, All Rights Reserved. License details can be found on: http://www.eci.org/eci/en/eciRGB.php</ICC_Profile:ProfileCopyright>
 <ICC_Profile:ProfileDescription>eciRGB v2</ICC_Profile:ProfileDescription>
 <ICC_Profile:MediaWhitePoint>0.9642 1 0.82491</ICC_Profile:MediaWhitePoint>
 <ICC_Profile:RedTRC>(Binary data 1412 bytes, use -b option to extract)</ICC_Profile:RedTRC>
 <ICC_Profile:GreenTRC>(Binary data 1412 bytes, use -b option to extract)</ICC_Profile:GreenTRC>
 <ICC_Profile:BlueTRC>(Binary data 1412 bytes, use -b option to extract)</ICC_Profile:BlueTRC>
 <ICC_Profile:RedMatrixColumn>0.65027 0.32028 0</ICC_Profile:RedMatrixColumn>
 <ICC_Profile:GreenMatrixColumn>0.17804 0.60205 0.06783</ICC_Profile:GreenMatrixColumn>
 <ICC_Profile:BlueMatrixColumn>0.13588 0.07767 0.75708</ICC_Profile:BlueMatrixColumn>
 <Composite:ImageSize>1986x2895</Composite:ImageSize>
 <Composite:Megapixels>5.7</Composite:Megapixels>
</rdf:Description>

So we could use this to check if an ICC profile is defined at either the PDF object level (pdfimages output) or the JPEG level (ExifTool output).

bitsgalore commented 1 month ago

Refinement: instead of extracting all images in one go, we can also do this one at a time. E.g. to extract the image at page 10:

pdfimages -f 10 -l 10 -all BKT-ecur002glas01_01.pdf ./images/

We can then create 1 parent element for each image, and add the pdfimages and ExifTool output inside that. This will also make it possible to create Schematron rules that combine properties from both tools.

BUT results will be unexpected if a page contains more than one image, since -f and -l define page numbers, not image numbers! But we could check that by running the pdfimages list command one page at a time as well:

pdfimages -f 10 -l 10 -list BKT-ecur002glas01_01.pdf

Perhaps then subdivide the output in "page" elements, which can contain one of more "image" elements. E.g. something like this:

<?xml version='1.0' encoding='UTF-8'?>
<pdfbatchqa>
  <file>
    <filePath></filePath>
    <fileSize></fileSize>
    <pdfinfo>
    </pdfinfo>
    <pages>
      <page>
        <image>
          <pdfimages>
          </pdfimages>
          <exifTool>
          </exifTool>
        </image>
      </page>
   </pages>
  </file>
</pdfbatchqa>