drewnoakes / metadata-extractor

Extracts Exif, IPTC, XMP, ICC and other metadata from image, video and audio files
Apache License 2.0
2.54k stars 475 forks source link

Wrong charset when CodedCharacterSet=ESC - A #614

Closed kenwa closed 1 year ago

kenwa commented 1 year ago

According to https://en.wikipedia.org/wiki/ISO/IEC_2022#cite_note-14.3.2-90 ISO-8859-1 should be used both when CodedCharacterSet is

Currently, only the first two syntaxes are supported.

The fix seems to as simple as adding a new constant to Iso2022Converter

private static final byte MINUS_SIGN = 0x2D;

and add an extra if clause to com.drew.metadata.iptc.Iso2022Converter#convertISO2022CharsetToJavaCharset

if (bytes.length > 2 && bytes[0] == ESC && bytes[1] == MINUS_SIGN && bytes[2] == LATIN_CAPITAL_A) return ISO_8859_1;

The Iso2022ConverterTest.java should also be extended with

assertEquals("ISO-8859-1", Iso2022Converter.convertISO2022CharsetToJavaCharset(new byte[]{0x1B, (byte)0x2D, (byte)0x41})); A pull request has been created https://github.com/drewnoakes/metadata-extractor/pull/615

drewnoakes commented 1 year ago

Thanks for the bug report and for the PR to fix it.

Are you able to share an image that reproduces this issue, so that we can add it to the public regression test data set?

kenwa commented 1 year ago

Sure! Due to access rights I cannot share the image where I found the issue, but created an image with a similar problem.

The image has CodedCharacterSet=ESC - A and a headline containing some french characters Headline=l'Affiche présentait étaient.

test

drewnoakes commented 1 year ago

Thanks very much! I ported your fix to the .NET implementation in https://github.com/drewnoakes/metadata-extractor-dotnet/pull/335 and added your sample image in https://github.com/drewnoakes/metadata-extractor-images/commit/31209ed64b24fa593cc3538e7997c29c09d6ed51.