Interpret UserComment character code

drewnoakes / metadata-extractor-dotnet

Extracts Exif, IPTC, XMP, ICC and other metadata from image, video and audio files

Other

939 stars 167 forks source link

Interpret UserComment character code #277

Closed patricksadowski closed 3 years ago

patricksadowski commented 3 years ago

Reads first 8 bytes for character encoding recognition to comply with EXIF standard.

It's not necessary to read 10 bytes at the beginning. Reading these two extra bytes could lead to missing bytes and the end. Switching the byte order for UNICODE to big-endian solves the issue with 8 and 10 bytes reads.

Only for ASCII NULL termination is done, "since the type is not ASCII, NULL termination is not necessary".

Closes #173

patricksadowski commented 3 years ago

If a run MetadataExtractor.Tools.FileProcessor on metadata-extractor-images I get following changes:

jpg/metadata/dotnet/Issue 308.jpg.txt
@@ -34 +34 @@ TYPE: JPEG
-[Exif SubIFD - 0x9286] User Comment = ASCII
+[Exif SubIFD - 0x9286] User Comment =
b/jpg/metadata/dotnet/Samsung SM-C101 (Galaxy S4 Zoom).jpg.txt
@@ -71 +71 @@ TYPE: JPEG
-[GPS - 0x001b] GPS Processing Method = ASCII
+[GPS - 0x001b] GPS Processing Method =

and the sample image of #173 works too.

drewnoakes commented 3 years ago

Thanks, this looks great. I'll add the image from 173 to the library too, as it's a more interesting example than the ones in the diff above.

Seems tests are failing. I don't think they're all related to this PR, so I'll fix those up after merging this.

drewnoakes commented 3 years ago

I patched up the unit tests.

With respect the UNICODE byte order (i.e. change to big-endian in this PR), it appears that the byte order of this UNICODE data is determined by the JPEG file's own byte order (i.e. II/MM). So this change isn't technically correct, but neither was the previous code.

patricksadowski commented 3 years ago

Seems tests are failing.

Sorry, I forgot the tests. Next time I start with the tests. Thanks for fixing.

So this change isn't technically correct

I'm aware of this. It's a lot closer to the specification than before (8 bytes vs. 10 bytes). The specification is not clear about the endianess. I can't find any hint. Maybe I find a solution.