Open egouge opened 6 years ago
ExifTool also extracts the same XMP Packet (missing the '?>') from the file. Unlike ExifTool, Metadata-Extractor uses xmpcore for extracting the XMP metadata. I feel like this file should be considered invalid due to the incorrect segment size. The XMP documentation on page 10 states the following about XMP packet wrappers:
Since there aren't two 'complete' markers, we are not able to identify the bounds of the XMP. I think ExifTool just doesn't check for a packet wrapper ending and reads the XMP data as it scans initially. Ultimately, unless this is a common issue, I think it should probably be considered invalid as the jpeg segment is correctly extracted and it doesn't follow the XMP specifications.
That is my interpretation of the documentation, but feel free to suggest another way to go about this.
Interesting one!
This appears to be a bug in the file. The relevant bytes include the two before the ones highlighted:
FF E1 0B 0E
Where:
FF
is the JPEG segment start byteE1
is the JPEG segment type (APP1)0B 8E
the length of the segment (s1 s2) (2958 decimal)The bytes s1 and s2 are taken together to represent a big-endian 16-bit integer specifying the length of the following "data bytes" plus the 2 bytes used to represent the length.
Jumping forward 2958 bytes from 0B
takes us here:
In other words, the structure of the JPEG file is wrong. As identified, the bytes given to the XMP parser have a truncated packer wrapper close tag.
Metadata Extractor overcomes the problem as the JPEG level. When it doesn't see the required FF
byte followed by a non-00
/FF
byte, it scans forward in the hope of finding it. This catches a few of these cases. However this scanning doesn't provide the extended bytes to the segment handler (XmpReader
in this case).
There are always trade-offs to adding workarounds for recoverably invalid data streams. In general I like doing it when it's safe to do so. So how can we do this, and would it introduce other problems? The sample image database will be useful in evaluating prospective solutions.
Approaches I can see:
The first approach seems more attractive due to its generality, though there are some unanswered questions:
byte[]
objects on the heap. This approach should reduce GC pressure by processing data incrementally. Instead of arrays, the segment handlers are given a stream to read from, and a length.So, some research is required.
I tried out the first approach. It made the code in JpegSegmentReader
quite a bit more complex (maybe there's a nicer way to implement it). The image here parses correctly. The only other changes were to images from the JPEG ImageTestSuite, most of which are known to be invalid anyway.
Thanks for feedback. If there is a possibility your investigation/fix can be integrated into the library without causing other problems, that would be great. If not, we'll have to come up with a workaround. I can do some testing or provide additional files for testing if that would help.
I appreciate the quick response. Thanks.
@egouge If you have more of these, I guess it would be interesting to know what the source is (what camera, device or software that produces these).
My understanding is that they are produced by a customized camera trap used for wildlife monitoring. I do not know the details, but I did inform our contact with that project of the issue with the file. Hopefully eventually they can fix it but in the meantime we have a collection of files whose metadata we'd like to be able to read.
Here's the diff to JpegSegmentReader
that extends the byte array before passing it on.
diff --git a/Source/com/drew/imaging/jpeg/JpegSegmentReader.java b/Source/com/drew/imaging/jpeg/JpegSegmentReader.java
index 55e6bd01..c95be822 100644
--- a/Source/com/drew/imaging/jpeg/JpegSegmentReader.java
+++ b/Source/com/drew/imaging/jpeg/JpegSegmentReader.java
@@ -28,6 +28,8 @@ import com.drew.lang.annotations.Nullable;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;
@@ -112,6 +114,9 @@ public class JpegSegmentReader
JpegSegmentData segmentData = new JpegSegmentData();
+ byte lastSegmentType = 0;
+ byte[] segmentBytes = null;
+
do {
// Find the segment marker. Markers are zero or more 0xFF bytes, followed
// by a 0xFF and then a byte not equal to 0x00 or 0xFF.
@@ -120,11 +125,28 @@ public class JpegSegmentReader
byte segmentType = reader.getInt8();
// Read until we have a 0xFF byte followed by a byte that is not 0xFF or 0x00
+ ArrayList<Byte> extendedBy = null;
while (segmentIdentifier != SEGMENT_IDENTIFIER || segmentType == SEGMENT_IDENTIFIER || segmentType == 0) {
+ if (extendedBy == null)
+ extendedBy = new ArrayList<Byte>();
+ extendedBy.add(segmentIdentifier);
segmentIdentifier = segmentType;
segmentType = reader.getInt8();
}
+ if (segmentBytes != null) {
+ if (extendedBy != null) {
+ byte[] extendedBytes = Arrays.copyOf(segmentBytes, segmentBytes.length + extendedBy.size());
+ for (int i = 0; i < extendedBy.size(); i++) {
+ extendedBytes[segmentBytes.length + i] = extendedBy.get(i);
+ }
+ segmentBytes = extendedBytes;
+ }
+
+ segmentData.addSegment(lastSegmentType, segmentBytes);
+ segmentBytes = null;
+ }
+
if (segmentType == SEGMENT_SOS) {
// The 'Start-Of-Scan' segment's length doesn't include the image data, instead would
// have to search for the two bytes: 0xFF 0xD9 (EOI).
@@ -148,9 +170,10 @@ public class JpegSegmentReader
// Check whether we are interested in this segment
if (segmentTypeBytes == null || segmentTypeBytes.contains(segmentType)) {
- byte[] segmentBytes = reader.getBytes(segmentLength);
+ segmentBytes = reader.getBytes(segmentLength);
assert (segmentLength == segmentBytes.length);
- segmentData.addSegment(segmentType, segmentBytes);
+ lastSegmentType = segmentType;
} else {
// Skip this segment
if (!reader.trySkip(segmentLength)) {
I'm just capturing it here for now. Still not sure what the best approach is in this case.
I just got an update from the generator of the source of the files and they have updated their code to fix this issue so I guess this is not a bit issue for us anymore. But if the fix is useful to the library we can still do some testing. Thanks again for all your help.
Thanks for following up. That definitely lowers the priority, but I'll leave this open.
I'm having problems reading xmp metadata from a jpg file.
The xmp metadata is readable by other tools. For example the output from exiftool:
However, when I try to read the xmp metadata using the metadata-extractor tools no xmp metadata is returned (the following prints nothing):
I did a bunch of debugging and tracked this down to a potential issue with the xmp metadata segment length. The metadata-extractor reads the xmp xml to parse as the following:
As you can see this xml string is missing the last two characters - "?>" - necessary to make this xml valid. So the xml parsing is failing and no xmp tags are returned.
From what I can tell the metadata-extractor assumes the length provided for the xmp segment includes the two bytes required to define the length, whereas the file I have assumes the length is the content length and does not include these two bytes. If I hack up the binary and increase the segment length by two the metadata-extractor happily reads the xmp metadata and reports it to me.
I have no idea what the standards are for this. The files I have may be invalid, but is there any way to make this library flexible to deal with these cases?