UNICODE EXIF UserComment tag read as BigEndian Unicode results in incorrect decoding

drewnoakes / metadata-extractor-dotnet

Extracts Exif, IPTC, XMP, ICC and other metadata from image, video and audio files

Other

915 stars 158 forks source link

UNICODE EXIF UserComment tag read as BigEndian Unicode results in incorrect decoding #423

Open RupertAvery opened 1 month ago

RupertAvery commented 1 month ago

338819806-4bc1ebcf-6edb-4545-92e5-ff2fc4f7cfb2

The metadata in the file has a UserComment tag in the Exif SubIFD directory that contains a UNICODE-encoded text containing JSON. With the existing code, the text will be decoded using BigEndianUnicode, which will result in incorrect text.

If the Encoding in TagDescriptor for the UNICODE encodingMap is set to Encoding.Unicode, it will decode properly.

Should this be just Unicode? Is there a discriminator that determines what endianess it should use?

RupertAvery commented 1 month ago

MetadataExtractor/TagDescriptor.cs @ L.370

            // TODO use ByteTrie here
            // Someone suggested "ISO-8859-1".
            var encodingMap = new Dictionary<string, Encoding>
            {
                ["ASCII"] = Encoding.ASCII,
                ["UTF8"] = Encoding.UTF8,
#pragma warning disable SYSLIB0001 // Type or member is obsolete
                ["UTF7"] = Encoding.UTF7,
#pragma warning restore SYSLIB0001 // Type or member is obsolete
                ["UTF32"] = Encoding.UTF32,
               // Affected code
                ["UNICODE"] = Encoding.Unicode,
            };

drewnoakes commented 3 weeks ago

It's a good question. There might not be one true answer unfortunately. Perhaps the endianness of the TIFF data stream should be used. However I doubt that different cameras/software handle this consistently.

Generally in this case I run the code before/after on the regression test suite to see whether it helps more than it hurts.

A workaround is to extract the comment bytes (StringValue) and use an explicit encoding directly.