drewnoakes / metadata-extractor

Extracts Exif, IPTC, XMP, ICC and other metadata from image, video and audio files
Apache License 2.0
2.59k stars 484 forks source link

schema namespaces not being observed across multiple xpackets #435

Open akash1810 opened 5 years ago

akash1810 commented 5 years ago

When an image has multiple xpackets, the schema namespace prefixes are not obeyed when the same schema is used.

The attached image is a cropped version of an original image supplied by Getty. The image has three xpackets (attached).

In the first packet, the last rdf:Description namespaces the http://xmp.gettyimages.com/gift/1.0/ schema to prefix0 the second and third packets use the GettyImagesGIFT namespace for the same schema.

When running the image through ImageMetadataReader the AssetId in the last packet comes out as prefix0:AssetId rather than GettyImagesGIFT:AssetId as written in the packet.

java -cp xmpcore-6.0.6.jar:metadata-extractor-2.12.0.jar com.drew.imaging.ImageMetadataReader cech.jpg -markdown
Directory Tag Id Tag Name Extracted Value
JPEG 0xfffffffd Compression Type Baseline
JPEG 0x0 Data Precision 8 bits
JPEG 0x1 Image Height 8 pixels
JPEG 0x3 Image Width 8 pixels
JPEG 0x5 Number of Components 3
JPEG 0x6 Component 1 Y component: Quantization table 0, Sampling factors 1 horiz/1 vert
JPEG 0x7 Component 2 Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert
JPEG 0x8 Component 3 Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert
JpegComment 0x0 JPEG Comment Generated by IJG JPEG Library�
JpegComment 0x0 JPEG Comment Generated by IJG JPEG Library�
JFIF 0x5 Version 1.2
JFIF 0x7 Resolution Units inch
JFIF 0x8 X Resolution 300 dots
JFIF 0xa Y Resolution 300 dots
JFIF 0xc Thumbnail Width Pixels 0
JFIF 0xd Thumbnail Height Pixels 0
XMP 0xffff XMP Value Count 29
XMP photoshop:AuthorsPosition Staff
XMP prefix0:Personality[1] Petr Cech
XMP dc:description[1]/xml:lang x-default
XMP photoshop:SupplementalCategories[1] FOC
XMP photoshop:DateCreated 2008-08-20
XMP Iptc4xmpCore:CountryCode GBR
XMP photoshop:Credit Getty Images
XMP photoshop:CaptionWriter jm
XMP prefix0:CameraMakeModel Canon EOS-1D Mark III
XMP photoshop:City London
XMP dc:description[1] LONDON - AUGUST 20: Czech Republic goalkeeper Petr Cech in action during the international friendly match between England and the Czech Republic at Wembley Stadium on August 20, 2008 in London, England. (Photo by Phil Cole/Getty Images)
XMP photoshop:Headline England v Czech Republic - International Friendly
XMP dc:title[1]/xml:lang x-default
XMP dc:rights[1]/xml:lang x-default
XMP photoshop:TransmissionReference 81774706
XMP photoshop:Source Getty Images Europe
XMP prefix0:CameraFilename 8R8Z0144.JPG
XMP photoshop:Category S
XMP dc:title[1] 81774706JM148_England_v_Cze
XMP prefix0:OriginalFilename 2008208_81774706JM148_England_v_Cze.jpg
XMP prefix0:OriginalCreateDateTime 2008-08-20T21:25:49+01:00
XMP dc:rights[1] 2008 Getty Images
XMP prefix0:TimeShot 212019+0200
XMP photoshop:Country United Kingdom
XMP prefix0:Composition Full Length
XMP prefix0:ImageRank 3
XMP xmpMM:InstanceID uuid:faf5bdd5-ba3d-11da-ad31-d33d75182f1b
XMP dc:creator[1] Phil Cole
XMP prefix0:CameraSerialNumber 0000571198
XMP 0xffff XMP Value Count 43
XMP photoshop:AuthorsPosition Staff
XMP prefix0:Personality[1] Petr Cech
XMP dc:description[1]/xml:lang x-default
XMP dc:subject[12] Soccer
XMP photoshop:SupplementalCategories[1] FOC
XMP dc:subject[14] UK
XMP dc:Rights 2008 Getty Images
XMP photoshop:SupplementalCategories[3] SOC
XMP dc:subject[9] Activity
XMP photoshop:DateCreated 2008-08-20T00:00:00 +00:00
XMP prefix0:CallForImage False
XMP dc:subject[10] Wembley Stadium
XMP dc:subject[7] Czech Republic
XMP Iptc4xmpCore:CountryCode GBR
XMP photoshop:Credit Getty Images
XMP dc:subject[5] Petr Cech
XMP dc:subject[3] Full Body Isolated
XMP dc:subject[16] Club Soccer
XMP dc:subject[1] England
XMP dc:description[1] (FILE PHOTO) Petr Cech has announced he is retiring at the end of the season LONDON - AUGUST 20: Czech Republic goalkeeper Petr Cech in action during the international friendly match between England and the Czech Republic at Wembley Stadium on August 20, 2008 in London, England. (Photo by Phil Cole/Getty Images)
XMP photoshop:City London
XMP dc:title[1]/xml:lang x-default
XMP photoshop:Headline Petr Cech Retiring At End of the Season England v Czech Republic - International Friendly
XMP dc:subject[11] Stadium
XMP photoshop:Source Getty Images Europe
XMP plus:ImageSupplierImageId 82486881
XMP prefix0:AssetId 82486881
XMP dc:subject[13] Friendly Match
XMP photoshop:Category S
XMP photoshop:SupplementalCategories[2] SPO
XMP prefix0:OriginalFilename 55531478
XMP dc:title[1] 82486881
XMP prefix0:OriginalCreateDateTime 0001-01-01T00:00:00 +00:00
XMP dc:subject[8] Full Length
XMP dc:subject[6] Vertical
XMP dc:subject[4] Sport
XMP dc:subject[15] London - England
XMP dc:subject[2] Motion
XMP dc:subject[17] Goalie
XMP prefix0:ExclusiveCoverage False
XMP photoshop:Country United Kingdom
XMP prefix0:ImageRank 1
XMP dc:creator[1] Phil Cole
Huffman 0x1 Number of Tables 4 Huffman tables
File Type 0x1 Detected File Type Name JPEG
File Type 0x2 Detected File Type Long Name Joint Photographic Experts Group
File Type 0x3 Detected MIME Type image/jpeg
File Type 0x4 Expected File Name Extension jpg
File 0x1 File Name cech.jpg
File 0x2 File Size 11104 bytes
File 0x3 File Modified Date Thu Sep 05 18:01:45 +01:00 2019

It looks like the XMPMetaFactory* is keeping a cache of previously seen namespaces and is causing this issue. That is, after parsing the first packet, XMPMetaFactory.schema has an entry of 'prefix0' -> 'http://xmp.gettyimages.com/gift/1.0/'. When the second packet is parsed, although it's using a different namespace for the http://xmp.gettyimages.com/gift/1.0/ schema (GettyImagesGIFT) it is read as prefix0, similarly for the third packet.

One possible fix for this issue would be to call .reset() at the start of each extraction, however I'm not sure what impact that'll have on performance and wouldn't be safe for parallel processing.

Another option could be to seed the XMPSchemaRegistry with known namespaces upfront, before reading any files. This is what ExifTool does and what we've started doing in Grid.

What do you think?

*An official repo for Adobe xmpcore isn't available on GitHub, so linking to your copy.

Assets cech

cech-xpacket-1.txt cech-xpacket-2.txt cech-xpacket-3.txt

akash1810 commented 5 years ago

Additionally if you're running a long-lived process, such as a server, the state of the XMPSchemaRegistry impacts all images as the XMPMetaFactory is a singleton.

That is, if the first image to be ingested was the cech.jpg attached, any following images that use the Getty schema and the GettyImagesGIFT prefix, will appear as prefix0:, which is incorrect. This is noted (with tests) in the Grid PR.