drewnoakes / metadata-extractor

Extracts Exif, IPTC, XMP, ICC and other metadata from image, video and audio files
Apache License 2.0
2.55k stars 480 forks source link

Work around JPEG segments with invalid lengths? #308

Open egouge opened 6 years ago

egouge commented 6 years ago

I'm having problems reading xmp metadata from a jpg file. buffalo_cleaned_metadata

The xmp metadata is readable by other tools. For example the output from exiftool:

$exiftool -xmp:* buffalo_cleaned_metadata.jpg
XMP Toolkit                     : XMP Core 5.1.2
Camera Station                  : 1
Species                         : LIVESTOCK
Group Size                      : 1
Gender                          : MALE
...

However, when I try to read the xmp metadata using the metadata-extractor tools no xmp metadata is returned (the following prints nothing):

Metadata metadata = ImageMetadataReader.readMetadata(new File("buffalo_cleaned_metadata.jpg"));
XmpDirectory xmp = metadata.getFirstDirectoryOfType(XmpDirectory.class);
XMPMeta xmpMeta = xmp.getXMPMeta();
XMPIterator itr = xmpMeta.iterator();
while (itr.hasNext()) {
     XMPPropertyInfo pi = (XMPPropertyInfo) itr.next();
     System.out.println(pi);
}

I did a bunch of debugging and tracked this down to a potential issue with the xmp metadata segment length. The metadata-extractor reads the xmp xml to parse as the following:

<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 5.1.2"> 
....
</x:xmpmeta> <?xpacket end="w"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

As you can see this xml string is missing the last two characters - "?>" - necessary to make this xml valid. So the xml parsing is failing and no xmp tags are returned.

From what I can tell the metadata-extractor assumes the length provided for the xmp segment includes the two bytes required to define the length, whereas the file I have assumes the length is the content length and does not include these two bytes. If I hack up the binary and increase the segment length by two the metadata-extractor happily reads the xmp metadata and reports it to me. xmp_segment_length

I have no idea what the standards are for this. The files I have may be invalid, but is there any way to make this library flexible to deal with these cases?

payton commented 6 years ago

ExifTool also extracts the same XMP Packet (missing the '?>') from the file. Unlike ExifTool, Metadata-Extractor uses xmpcore for extracting the XMP metadata. I feel like this file should be considered invalid due to the incorrect segment size. The XMP documentation on page 10 states the following about XMP packet wrappers:

screen shot 2017-11-28 at 6 56 12 pm

Since there aren't two 'complete' markers, we are not able to identify the bounds of the XMP. I think ExifTool just doesn't check for a packet wrapper ending and reads the XMP data as it scans initially. Ultimately, unless this is a common issue, I think it should probably be considered invalid as the jpeg segment is correctly extracted and it doesn't follow the XMP specifications.

That is my interpretation of the documentation, but feel free to suggest another way to go about this.

drewnoakes commented 6 years ago

Interesting one!

This appears to be a bug in the file. The relevant bytes include the two before the ones highlighted:

FF E1 0B 0E

Where:

The bytes s1 and s2 are taken together to represent a big-endian 16-bit integer specifying the length of the following "data bytes" plus the 2 bytes used to represent the length.

Jumping forward 2958 bytes from 0B takes us here:

image

In other words, the structure of the JPEG file is wrong. As identified, the bytes given to the XMP parser have a truncated packer wrapper close tag.

Metadata Extractor overcomes the problem as the JPEG level. When it doesn't see the required FF byte followed by a non-00/FF byte, it scans forward in the hope of finding it. This catches a few of these cases. However this scanning doesn't provide the extended bytes to the segment handler (XmpReader in this case).

There are always trade-offs to adding workarounds for recoverably invalid data streams. In general I like doing it when it's safe to do so. So how can we do this, and would it introduce other problems? The sample image database will be useful in evaluating prospective solutions.

Approaches I can see:

  1. When scanning forward to find the next JPEG segment, include skipped bytes in the segment data (the general solution)
  2. Modify the XMP handling code to deal with truncated closing packets

The first approach seems more attractive due to its generality, though there are some unanswered questions:

  1. Will it cause breakages on other images? The extra bytes received by segment handlers could be either garbage or treasure.
  2. Will it prohibit moving to the newer JPEG parsing approach that's being explored in the .NET library? Namely that JPEG data isn't copied in bulk to byte[] objects on the heap. This approach should reduce GC pressure by processing data incrementally. Instead of arrays, the segment handlers are given a stream to read from, and a length.

So, some research is required.

drewnoakes commented 6 years ago

I tried out the first approach. It made the code in JpegSegmentReader quite a bit more complex (maybe there's a nicer way to implement it). The image here parses correctly. The only other changes were to images from the JPEG ImageTestSuite, most of which are known to be invalid anyway.

egouge commented 6 years ago

Thanks for feedback. If there is a possibility your investigation/fix can be integrated into the library without causing other problems, that would be great. If not, we'll have to come up with a workaround. I can do some testing or provide additional files for testing if that would help.

I appreciate the quick response. Thanks.

Nadahar commented 6 years ago

@egouge If you have more of these, I guess it would be interesting to know what the source is (what camera, device or software that produces these).

egouge commented 6 years ago

My understanding is that they are produced by a customized camera trap used for wildlife monitoring. I do not know the details, but I did inform our contact with that project of the issue with the file. Hopefully eventually they can fix it but in the meantime we have a collection of files whose metadata we'd like to be able to read.

drewnoakes commented 6 years ago

Here's the diff to JpegSegmentReader that extends the byte array before passing it on.

diff --git a/Source/com/drew/imaging/jpeg/JpegSegmentReader.java b/Source/com/drew/imaging/jpeg/JpegSegmentReader.java
index 55e6bd01..c95be822 100644
--- a/Source/com/drew/imaging/jpeg/JpegSegmentReader.java
+++ b/Source/com/drew/imaging/jpeg/JpegSegmentReader.java
@@ -28,6 +28,8 @@ import com.drew.lang.annotations.Nullable;
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
 import java.util.HashSet;
 import java.util.Set;

@@ -112,6 +114,9 @@ public class JpegSegmentReader

         JpegSegmentData segmentData = new JpegSegmentData();

+        byte lastSegmentType = 0;
+        byte[] segmentBytes = null;
+
         do {
             // Find the segment marker. Markers are zero or more 0xFF bytes, followed
             // by a 0xFF and then a byte not equal to 0x00 or 0xFF.
@@ -120,11 +125,28 @@ public class JpegSegmentReader
             byte segmentType = reader.getInt8();

             // Read until we have a 0xFF byte followed by a byte that is not 0xFF or 0x00
+            ArrayList<Byte> extendedBy = null;
             while (segmentIdentifier != SEGMENT_IDENTIFIER || segmentType == SEGMENT_IDENTIFIER || segmentType == 0) {
+                if (extendedBy == null)
+                    extendedBy = new ArrayList<Byte>();
+                extendedBy.add(segmentIdentifier);
                segmentIdentifier = segmentType;
                segmentType = reader.getInt8();
             }

+            if (segmentBytes != null) {
+                if (extendedBy != null) {
+                    byte[] extendedBytes = Arrays.copyOf(segmentBytes, segmentBytes.length + extendedBy.size());
+                    for (int i = 0; i < extendedBy.size(); i++) {
+                        extendedBytes[segmentBytes.length + i] = extendedBy.get(i);
+                    }
+                    segmentBytes = extendedBytes;
+                }
+
+                segmentData.addSegment(lastSegmentType, segmentBytes);
+                segmentBytes = null;
+            }
+
             if (segmentType == SEGMENT_SOS) {
                 // The 'Start-Of-Scan' segment's length doesn't include the image data, instead would
                 // have to search for the two bytes: 0xFF 0xD9 (EOI).
@@ -148,9 +170,10 @@ public class JpegSegmentReader

             // Check whether we are interested in this segment
             if (segmentTypeBytes == null || segmentTypeBytes.contains(segmentType)) {
-                byte[] segmentBytes = reader.getBytes(segmentLength);
+                segmentBytes = reader.getBytes(segmentLength);
                 assert (segmentLength == segmentBytes.length);
-                segmentData.addSegment(segmentType, segmentBytes);
+                lastSegmentType = segmentType;
             } else {
                 // Skip this segment
                 if (!reader.trySkip(segmentLength)) {

I'm just capturing it here for now. Still not sure what the best approach is in this case.

egouge commented 6 years ago

I just got an update from the generator of the source of the files and they have updated their code to fix this issue so I guess this is not a bit issue for us anymore. But if the fix is useful to the library we can still do some testing. Thanks again for all your help.

drewnoakes commented 6 years ago

Thanks for following up. That definitely lowers the priority, but I'll leave this open.