Open drewnoakes opened 9 years ago
I think this happens because of differences in your C# and Java version default encodings when reading null terminated strings that have no character set designated in the Exif. It's impossible to know what encoding was intended.
In C#, for the method IndexedReader.GetNullTerminatedString
, you're presuming UTF8 on a byte array without a known character set:
return Encoding.UTF8.GetString(bytes, 0, length);
In Java, for the method RandomAccessReader.getNullTerminatedString
, it uses a String constructor that doesn't designate a character set. It's my understanding that Java "probably" uses the local system's default character set, which could be UTF8, UTF16, 1252, etc:
return new String(bytes, 0, length);
The only way to make the two versions match is to designate the same character set explicitly in both. I don't think it really matters which; it's up to you how much should be considered "readable" overall given that the byte arrays do not carry a set designator. UTF8 is the most common, and what was really intended for Exif (regardless of whether people coded to spec). Java would be:
return new String(bytes, 0, length, Charset.forName("UTF-8"));
OTOH, I changed the C# GetNullTerminatedString to this and it matched Java (at least on the platform where you ran the test):
return Encoding.GetEncoding(1252).GetString(bytes, 0, length);
If you use that one, the Java explicit equivalent is something like (may not be exact):
return new String(bytes, 0, length, Charset.forName("windows-1252"));
Up to you - particularly if it causes side effects with other fields that use these same methods. Thanks
Thanks for your rigour on this. I'll take a look through the various camera makernote samples and see whether there's consistent use of encoding 1252. If a manufacturer uses consistent encoding in their makernotes, then there's more opportunity to do the right thing in both the Java and .NET versions by being explicit.
In general I'd like to move towards a design where the bytes are not decoded at extraction time, but rather when the user asks for the description/string. In that way the user may specify an encoding in cases where they know best.
There's also some code that attempts to guess as the encoding, which might be worth looking at in this case too.