drewnoakes / metadata-extractor-dotnet

Extracts Exif, IPTC, XMP, ICC and other metadata from image, video and audio files
Other
947 stars 170 forks source link

Encoding problems in Canon makernote "Owner Name" string #19

Open drewnoakes opened 9 years ago

drewnoakes commented 9 years ago
diff --git a/jpg/metadata/canon ixus 750.jpg.txt b/jpg/metadata/canon ixus 750.jpg.txt
index 2fe3e71..55fcb6d 100644
--- a/jpg/metadata/canon ixus 750.jpg.txt   
+++ b/jpg/metadata/canon ixus 750.jpg.txt   
@@ -137,7 +137,7 @@ FILE: Canon IXUS 750.jpg
 [Canon Makernote - 0x0006] Image Type = IMG:DIGITAL IXUS 750 JPEG
 [Canon Makernote - 0x0007] Firmware Version = Firmware Version 1.00
 [Canon Makernote - 0x0008] Image Number = 1003892
-[Canon Makernote - 0x0009] Owner Name = Renè Glanzer
+[Canon Makernote - 0x0009] Owner Name = Ren� Glanzer
 [Canon Makernote - 0x000d] Camera Info Array = [94 values]
 [Canon Makernote - 0x0010] Canon Model ID = 26214400
diff --git a/jpg/metadata/canon powershot g11.jpg.txt b/jpg/metadata/canon powershot g11.jpg.txt
index 98eb3fb..5f23627 100644
--- a/jpg/metadata/canon powershot g11.jpg.txt  
+++ b/jpg/metadata/canon powershot g11.jpg.txt  
@@ -141,7 +141,7 @@ FILE: Canon PowerShot G11.jpg
 [Canon Makernote - 0x0006] Image Type = IMG:PowerShot G11 JPEG
 [Canon Makernote - 0x0007] Firmware Version = Firmware Version 1.00
 [Canon Makernote - 0x0008] Image Number = 4120954
-[Canon Makernote - 0x0009] Owner Name = Balázs Iván József +36304028290
+[Canon Makernote - 0x0009] Owner Name = Bal�zs Iv�n J�zsef +36304028290
 [Canon Makernote - 0x000d] Camera Info Array = [171 values]
 [Canon Makernote - 0x0010] Canon Model ID = 40894464
 [Canon Makernote - 0x0026] AF Info Array 2 = [48 values]
kwhopper commented 9 years ago

I think this happens because of differences in your C# and Java version default encodings when reading null terminated strings that have no character set designated in the Exif. It's impossible to know what encoding was intended.

In C#, for the method IndexedReader.GetNullTerminatedString, you're presuming UTF8 on a byte array without a known character set:

return Encoding.UTF8.GetString(bytes, 0, length);

In Java, for the method RandomAccessReader.getNullTerminatedString, it uses a String constructor that doesn't designate a character set. It's my understanding that Java "probably" uses the local system's default character set, which could be UTF8, UTF16, 1252, etc:

return new String(bytes, 0, length);

The only way to make the two versions match is to designate the same character set explicitly in both. I don't think it really matters which; it's up to you how much should be considered "readable" overall given that the byte arrays do not carry a set designator. UTF8 is the most common, and what was really intended for Exif (regardless of whether people coded to spec). Java would be:

return new String(bytes, 0, length, Charset.forName("UTF-8"));

OTOH, I changed the C# GetNullTerminatedString to this and it matched Java (at least on the platform where you ran the test):

return Encoding.GetEncoding(1252).GetString(bytes, 0, length);

If you use that one, the Java explicit equivalent is something like (may not be exact):

return new String(bytes, 0, length, Charset.forName("windows-1252"));

Up to you - particularly if it causes side effects with other fields that use these same methods. Thanks

drewnoakes commented 9 years ago

Thanks for your rigour on this. I'll take a look through the various camera makernote samples and see whether there's consistent use of encoding 1252. If a manufacturer uses consistent encoding in their makernotes, then there's more opportunity to do the right thing in both the Java and .NET versions by being explicit.

In general I'd like to move towards a design where the bytes are not decoded at extraction time, but rather when the user asks for the description/string. In that way the user may specify an encoding in cases where they know best.

There's also some code that attempts to guess as the encoding, which might be worth looking at in this case too.

drewnoakes commented 8 years ago

http://www.i18nqa.com/debug/utf8-debug.html