Exiv2 doesn't correctly handle characters outside the Basic Multilingual Plane

Carnildo commented 4 years ago

Describe the bug When setting an EXIF comment ("Exif.Photo.UserComment") to a value that contains characters outside the Basic Multilingual Plane, the value is written as-is, which produces incorrect Unicode in the file.

To Reproduce The following code, tested with Exiv2 0.27.3, demonstrates the problem.

#include <exiv2/exiv2.hpp>
int main(void)
{
    auto exivImage = Exiv2::ImageFactory::open("img.jpg", false);
    exivImage->readMetadata();
    auto exif = exivImage->exifData();
    exif["Exif.Photo.UserComment"] = std::string("charset=\"Unicode\" EX\xf0\x9f\x98\x80"); // the letters "EX", followed by a smiley
    exivImage->setExifData(exif);
    exivImage->writeMetadata();
}

It will produce the following output on the command line:

Warning: iconv: Invalid or incomplete multibyte or wide character (errno = 84) inbytesleft = 4

Windows 10 File Explorer and exiftool agree that the comment that was written to the file is three Unicode characters: "塅鿰肘"

Example 2x2-pixel image with incorrect metadata:

Expected behavior Either:

The string gets converted to UTF-16 and stored in the EXIF data, or
An exception is thrown, since Unicode code points outside the BMP are not valid UCS-2.

Desktop (please complete the following information):

OS: Linux
Compiler & Version GCC 9.3.0

clanmills commented 4 years ago

This is very helpful. Thank You very much. As a native English speaker, I have no feel for unicode, code pages and other magic involved. I understand 7 bit ascii and little else.

I'm very pleased to say that @LeoHsiao1 has recently done a lot of work on our test suite. As he is Chinese, I hope he'll be able to comment and investigate this matter.

From your comments such as Basic Multilingual Plane, you sound knowledgable about this topic. I would very much appreciate help with this matter. The code involved isn't complicated as Exiv2 delegates to the iconv library.

With the right people on this task, I am confident of success.

clanmills commented 4 years ago

Some comments about this. I think the syntax is:

exif["Exif.Photo.UserComment"] = std::string("charset=Unicode EX\xf0\x9f\x98\x80");

659 rmills@rmillsmbp:~/temp $ curl -LO --silent https://clanmills.com/Stonehenge.jpg
660 rmills@rmillsmbp:~/temp $ exiv2 -g Comment Stonehenge.jpg 
Exif.Photo.UserComment                       Undefined  44  charset=Ascii                                     
661 rmills@rmillsmbp:~/temp $ exiv2 -M"set Exif.Photo.UserComment charset=Unicode EX\xf0\x9f\x98\x80" Stonehenge.jpg 
662 rmills@rmillsmbp:~/temp $ exiv2 -g Comment Stonehenge.jpg 
Exif.Photo.UserComment                       Undefined  44  charset=Unicode EX\xf0\x9f\x98\x80
663 rmills@rmillsmbp:~/temp $ exiftool Stonehenge.jpg | grep -i comment
User Comment                    : EX\xf0\x9f\x98\x80
664 rmills@rmillsmbp:~/temp $

Please don't regard this as a denial that there could be problems with our Unicode and other charset handling.

I think it has encoded the 18 byte string "EX\xf0\x9f\x98\x80" as 36 unicode bytes + 8 bytes for the charset definition. I can dump the "raw data" with the program tvisitor (which is in my book) https://clanmills.com/exiv2/book/

665 rmills@rmillsmbp:~/temp $ tvisitor -pR Stonehenge.jpg | grep -i comment
         428 | 0x9286 Exif.Photo.UserComment       | UNDEFINED |       44 |       820 | UNICODE_E_X_\_x_f_0_\_x_9_f_\_x_9_8_ +++
666 rmills@rmillsmbp:~/temp $

I have no expertise in working with character sets. We'll need a specialist to help with this. Perhaps @LeoHsiao1 knows what's involved.

clanmills commented 4 years ago

It's not hopeless. Something is working when I cut'n'paste your bamboo poles 塅

672 rmills@rmillsmbp:~/temp $ exiv2 -M"set Exif.Photo.UserComment charset=Unicode 塅" Stonehenge.jpg 
673 rmills@rmillsmbp:~/temp $ 
673 rmills@rmillsmbp:~/temp $ exiv2 -g Comment Stonehenge.jpg 
Exif.Photo.UserComment                       Undefined  10  charset=Unicode 塅
674 rmills@rmillsmbp:~/temp $ tvisitor -pR Stonehenge.jpg | grep -i comment
         428 | 0x9286 Exif.Photo.UserComment       | UNDEFINED |       10 |       820 | UNICODE_EX
675 rmills@rmillsmbp:~/temp $

As documented in the man page exiv2.1, I can use this to encode "Robin" in Unicode:

charset=Unicode \u0052\u006f\u0062\u0069\u006e

677 rmills@rmillsmbp:~/temp $ exiv2 -g Comment Stonehenge.jpg 
Exif.Photo.UserComment                       Undefined  10  charset=Unicode 饕
678 rmills@rmillsmbp:~/temp $ exiftool Stonehenge.jpg | grep -i comment
User Comment                    : 饕
679 rmills@rmillsmbp:~/temp $

clanmills commented 4 years ago

Everything seems to be working OK. Unicode \u2103 is the Chinese Degrees Celsius. https://en.wikipedia.org/wiki/Degree_symbol

736 rmills@rmillsmbp:~/temp $ exiv2 -M"set Exif.Photo.UserComment charset=Unicode It's 18 ℃ outside" Stonehenge.jpg ;exiv2 -g Comment Stonehenge.jpg 
Exif.Photo.UserComment                       Undefined  42  charset=Unicode It's 18 ℃ outside

I am using the program dmpf from my book to confirm that macOS Terminal did insert the correct Unicode.

737 rmills@rmillsmbp:~/temp $ tvisitor -pR Stonehenge.jpg  | grep -e Comment -e Stonehenge.jpg 
STRUCTURE OF JPEG FILE (II): Stonehenge.jpg
  STRUCTURE OF TIFF FILE (II): Stonehenge.jpg:12->15286
    STRUCTURE OF TIFF FILE (II): Stonehenge.jpg:12->15286
      STRUCTURE OF TIFF FILE (II): Stonehenge.jpg:12->15286:830->3142
      END: Stonehenge.jpg:12->15286:830->3142
         428 | 0x9286 Exif.Photo.UserComment       | UNDEFINED |       42 |      3972 | UNICODE_I_t_'_s_ _1_8_ _.! _o_u_t_s_ +++
    END: Stonehenge.jpg:12->15286
    STRUCTURE OF TIFF FILE (II): Stonehenge.jpg:12->15286
    END: Stonehenge.jpg:12->15286
  END: Stonehenge.jpg:12->15286
  STRUCTURE OF 8BIM FILE (MM): Stonehenge.jpg:17928->78
    STRUCTURE OF IPTC FILE (MM): Stonehenge.jpg:17928->78:12->39
    END: Stonehenge.jpg:17928->78:12->39
  END: Stonehenge.jpg:17928->78
END: Stonehenge.jpg
738 rmills@rmillsmbp:~/temp $ dmpf Stonehenge.jpg --skip=$((12+3972)) --count=60 --width=20 bs=2
   0xf90     3984: UNICODE_I_t_'_s_ _1_  ->  4e55 4349 444f   45   49   74   27   73   20   31
   0xfa4     4004: 8_ _.! _o_u_t_s_i_d_  ->    38   20 2103   20   6f   75   74   73   69   64
                                                       ----
   0xfb8     4024: e_._.__....___.___09  ->    65    2    2  100  201    1    0    1    0 3930

I can use \u2103 to put more Degrees Celsius and \u2109 for Degrees Fahrenheit.

739 rmills@rmillsmbp:~/temp $ exiv2 -M"set Exif.Photo.UserComment charset=Unicode 1\u2103 degreesC == 1.8\u2109 degreesF" Stonehenge.jpg ;exiv2 -g Comment Stonehenge.jpg 
Exif.Photo.UserComment                       Undefined  64  charset=Unicode 1℃ degreesC == 1.8℉ degreesF
740 rmills@rmillsmbp:~/temp $

This appears to work well. I'll wait for @LeoHsiao1 to comment before closing this.

Carnildo commented 4 years ago

You've only been testing with Basic Multilingual Plane codes (U+FFFF and lower). The problem arises with the supplementary planes (U+10000 and above). Most of the characters in this range are obscure languages or little-used Chinese ideographs, but the range U+1F300 to U+1FAFF contains emoji and other symbols.

clanmills commented 4 years ago

I really don't know if I can help with this as it's totally beyond my limited skills in this area. Let's hear what @LeoHsiao1 has to say.

I did try unsuccessfully to use the emoji characters this afternoon. Then I thought of messing with ℃ and that worked.

If this concerns "obscure languages" why are your reporting this? What is your use case?

Carnildo commented 4 years ago

I'm working on a photo organizer, and discovered the bug while testing sticking a smiley face into a photo description.

clanmills commented 4 years ago

OK. Let's hear what @LeoHsiao1 has to say. Exiv2 is about metadata. Exiv2 delegates unicode to libiconv.

tester0077 commented 4 years ago

As this topic is of great interest to myself, I did a few test to verify my own understanding of this field and it's content.

From reading of the specs, I had concluded that, in order for the field to contain anything other than plain ASCII, it had to start with the string 'UNICODE" in plain ASCII and then be followed by the UT-16 Unicode string.

My big problem was that I had been unable to find images with such Unicode comments, especially images which I felt confident enough that they actually would contain valid user comments.

My test with the given 'image' was to open it in a Hex editor and edit the string of interest until the one utility, which I trust at this stage, with this, WPMeta, showed the expected output.

FWIW, for the modified file it shows 2 smilies - because that is what the bytes I entered represent,while for the original file it shows similar characters, Chinese, I assume, as in the original post. At this stage, I am not 100% sure this is the 'correct' way, partly because Exiftool 11.63 does not even show anything for the UserComment for either the original or my modified image, while for my modified image Exiv2 0.27.3 gives

Copyright : Exif comment : charset=Unicode Ôÿ¦Ôÿ¦

As for the test program shown by Carnildo, it obviously assumes that Exiv2 will expect a UTF-8 string.

To be continued, I am sure :-)

Attached are my modified image as well as a screenshot of the hex editor data for the relevant section.

Arnold

Carnildo commented 4 years ago

A literal reading of the EXIF 2.3 standard implies that it uses the Unicode standard from 1991, which would be version 1.0. We really, really don't want to follow the standard in that regard: Unicode 1.0 doesn't support Chinese, has different encodings for Korean and Tibetan, doesn't support BiDi formatting, and is largely incompatible with later versions of Unicode.

The question becomes how far Exiv2 should go with ignoring the standard, which in turn becomes a question of how far other EXIF software has gone with ignoring the standard. From a practical standpoint, this comes down to encoding Unicode text as UCS-2 versus encoding it as UTF-16.

UTF-16 permits the full range of Unicode characters, but may be incompatible with readers that expect UCS-2 (if there are any -- UTF-16 was introduced with Unicode 2.0, in 1996), or with poorly-written software that assumes that one 16-bit Unicode value equals one character.

UCS-2 should be compatible with everything that supports Unicode, but only permits the first 65,536 Unicode characters.

clanmills commented 4 years ago

Thanks, Arnold. Your observation "to be continued" make me nervous! We need to recruit an expert in this field.

clanmills commented 4 years ago

Enough! This matter is closed. In 12 years of working on Exiv2, this is the first time I have closed an issue because it is outside the scope of the project.

The Exiv2 project is a cross-platform C++ library for 4 metadata standards in about 20 image formats. It supports unicode for UserComment (and two other tags) by delegating to iconv.

Without an expert to investigate other scenarios and "obscure" languages, nothing further can be done.

Carnildo commented 4 years ago

This is solidly within the scope: the function call exif["Exif.Photo.UserComment"] = std::string("charset=\"Unicode\" Q\xf0\x9f\x98\x80"); causes the Exiv2 library to produce a file that is not valid under any reading of the EXIF 2.3 standard. I've proposed two fixes (either throw an error, or store the string as UTF-16 rather than UCS-2); I don't know which is better.

clanmills commented 4 years ago

Are you offering to work on the code to deal with this?

clanmills commented 4 years ago

I've fixed it as follows. However iconv issues a warning:

773 rmills@rmillsmbp:~/gnu/github/exiv2/0.27-maintenance/build $ bin/exiv2 -M"set Exif.Photo.UserComment charset=Unicode Smile: 😀" ~/temp/Stonehenge.jpg ;exiv2 -g Comment ~/temp/Stonehenge.jpg 
Warning: iconv: Invalid argument (errno = 22) inbytesleft = 1
Exif.Photo.UserComment                       Undefined  19  Warning: iconv: Invalid argument (errno = 22) inbytesleft = 1
charset=Unicode Smile: 😀
774 rmills@rmillsmbp:~/gnu/github/exiv2/0.27-maintenance/build $

774 rmills@rmillsmbp:~/gnu/github/exiv2/0.27-maintenance/build $ git diff
diff --git a/src/value.cpp b/src/value.cpp
index 5bd815e2..ab21d5be 100644
--- a/src/value.cpp
+++ b/src/value.cpp
@@ -511,7 +511,7 @@ namespace Exiv2 {
         }
         if (charsetId == unicode) {
             const char* to = byteOrder_ == littleEndian ? "UCS-2LE" : "UCS-2BE";
-            convertStringCharset(c, "UTF-8", to);
+            convertStringCharset(c, "UTF-16", to);
         }
         const std::string code(CharsetInfo::code(charsetId), 8);
         return StringValueBase::read(code + c);
775 rmills@rmillsmbp:~/gnu/github/exiv2/0.27-maintenance/build $

As I said yesterday. I need specialist help to deal with this.

clanmills commented 4 years ago

I am reopening this issue, removing it from the 0.27.4 milestone and I am no longer assigned to this issue.

tester0077 commented 4 years ago

@Carnildo: things are not as simple as you seem to think. Exiv2 does not live in its own world where it can do what ever the coders (or some users) think appropriate. Even though the Exif spec 2.3 may be ambiguous, Robin, and who ever else modifies the code, has to stick to the spec because other users depend on that commitment. Perhaps the documentation does not spell out the restrictions imposed by this particular field, but in that case it is the documentation which needs updating.

LeoHsiao1 commented 4 years ago

I just saw this issue today. I am not familiar with Basic Multilingual Plane. As a Chinese, the encoding format I use most often is UTF-8. Exiv2 supports UTF-8 characters, which would have satisfied almost all my needs. Otherwise I wouldn't have continued to use exiv2.

clanmills commented 4 years ago

Thank You @LeoHsiao1 and Thank You @tester0077 for your feedback and input on this issue.

There was a discussion last month concerning charset=Unicode and Chinese. I believe the comments here by @LeoHsiao1 confirm that our support works adequately. https://github.com/Exiv2/exiv2/issues/1258#issuecomment-669916199

LeoHsiao1 commented 4 years ago

I developed my own version of pyexiv2, which converts Unicode strings to bytes and then saves them to the image.

C++ principle code:

std::string key = py::bytes('something');
std::string value = py::bytes('something');
exifData[key] = value;

The Python example:

>>> import pyexiv2
>>> img = pyexiv2.Image(r'D:\test.png')
>>> img.modify_exif({'Exif.Image.ImageDescription': 'test-中文-'}, encoding='UTF-8')
>>> img.read_exif(encoding='UTF-8')
{'Exif.Image.ImageDescription': 'test-中文-'}

tester0077 commented 4 years ago

As far as I understand things now, the problem reported really only relates to UserComment, which the Exif spec treats very differently.

Another aspect of this report relates to a question of how one should present input strings to Exiv2 when they are intended for that field.

FWIW, I am a bit confused as to where and how to respond best to this and related issues, because discussions relating to it/them, overlap several threads.

As well, the more I read and work with these issue, the more nuances become apparent. The issue raised in #1258, related to Exif.Image.Artist, IMO, also needs to take into account the software which (presumably) added the string.

What I find curious, is that not one of these 'editors' included the Software string, except the Android version and it includes announces itself as "Picasa".

Only some of them include the ExifVersion - though whether they apply the spec accordingly, I am still unsure of.

Then again, if the originating software does not write the expected string in the correct format, it is unfair, though somehow flattering, to blame Exiv2 for make that apparent, even if, perchance, its rendition is also flawed

At this point I have taken to view the data mainly in their hex format and find the differences 'interesting', though I have not really been able to come to any final conclusion because my understanding of what there ought to be is evolving as I see more data

@Robin: you refer to tvisitor in several places in the correspondence as well as in the document (and it lists some of the functions used in the app) you are writing. Is there a compiled or compileable version of the full program available?

Arnold

On 2020-09-08 5:34 AM, Robin Mills wrote:

Thank You @LeoHsiao1 https://github.com/LeoHsiao1 and Thank You @tester0077 https://github.com/tester0077 for your feedback and input on this issue.

There was a discussion last month concerning charset=Unicode and Chinese. I believe the comments here by @LeoHsiao1 https://github.com/LeoHsiao1 confirm that our support works adequately. #1258 (comment) https://github.com/Exiv2/exiv2/issues/1258#issuecomment-669916199

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Exiv2/exiv2/issues/1279#issuecomment-688836632, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFCLPFDJGDNXVXFA7S2RJDSEYQGVANCNFSM4Q35ZCJA.

clanmills commented 4 years ago

Arnold. There are three "Comment" Exif tags and they are UserComment, GPSProcessingMethod and GPSAreaInformation. They are in fact stored as an undefined byte stream. The first 8 bytes define the charset following by the encoded byte stream.

For example:

1143 rmills@rmillsmbp:~/temp $ curl -OL --silent https://clanmills.com/Stonehenge.jpg
1145 rmills@rmillsmbp:~/temp $ exiv2 -M'set Exif.Photo.UserComment charset=Unicode Robin' Stonehenge.jpg 
1146 rmills@rmillsmbp:~/temp $ exiv2 -g UserComment Stonehenge.jpg 
Exif.Photo.UserComment                       Undefined  18  charset=Unicode Robin   # 18 = 8 + 5x2
1147 rmills@rmillsmbp:~/temp $ exiv2 -M'set Exif.Photo.UserComment charset=Ascii Robin' Stonehenge.jpg 
1148 rmills@rmillsmbp:~/temp $ exiv2 -g UserComment Stonehenge.jpg 
Exif.Photo.UserComment                       Undefined  13  charset=Ascii Robin # 13 = 8 + 5
1149 rmills@rmillsmbp:~/temp $

The point that has been made by @Carnildo about accepting smiley's and other Unicode strings outside the Basic Multilingual Plane maybe worthy of attention, however I don't intend to work on this I know almost nothing about Unicode, Jis, codepages and related technology. To make progress with this, I believe we need a specialist on the team.

The Exif.Image.Artist is defined in the Adobe Tiff 6.0 specification as follows:

Exiv2 allows you to enter non 7-bit ascii characters into "Ascii" tags. For example:

1155 rmills@rmillsmbp:~/temp $ exiv2 -M'set Exif.Image.Artist Copyright © 2020' Stonehenge.jpg 
1156 rmills@rmillsmbp:~/temp $ exiv2 -g UserComment Stonehenge.jpg 
Exif.Photo.UserComment                       Undefined  13  charset=Ascii Robin
1157 rmills@rmillsmbp:~/temp $ exiv2 -g Artist Stonehenge.jpg 
Exif.Image.Artist                            Ascii      18  Copyright © 2020
1158 rmills@rmillsmbp:~/temp $

1159 rmills@rmillsmbp:~/temp $ 1160 rmills@rmillsmbp:~/temp $ echo © | dmpf -
       0        0: ...                               ->  c2 a9 0a
1161 rmills@rmillsmbp:~/temp $

The smiley (a 4 byte character) survives similar treatment:

1162 rmills@rmillsmbp:~/temp $ exiv2 -M'set Exif.Image.Artist Smile 😀 please' Stonehenge.jpg 
1163 rmills@rmillsmbp:~/temp $ exiv2 -g Artist Stonehenge.jpg 
Exif.Image.Artist                            Ascii      18  Smile 😀 please
1164 rmills@rmillsmbp:~/temp $

The code for tvisitor.cpp (and utilities dmpf.cpp and args.cpp) is documented in my book and available (source only) from svn://dev.exiv2.org/svn/team/book The book is available at https://clanmills.com/exiv2/book Please beware those materials are "work in progress" and change frequently. Your feedback is welcome and appreciated, however I don't provide support and hope to avoid discussion while the book is in development. These utilities are written in C++11 and I believe they build on most desktop platforms. They are "single file" programs with no dependencies.

tester0077 commented 4 years ago

Hi Robin,

all of this seems to be a work in progress, for myself in any case.

On 2020-09-08 11:22 AM, Robin Mills wrote:

Arnold. There are three "Comment" Exif tags and they are UserComment, GPSProcessingMethod and GPSAreaInformation. They are in fact stored as an undefined byte stream. The first 8 bytes define the charset following by the encoded byte stream.

Understood, though for now, my focus is mostly on UserComment. The other 2 I am not as familiar with as I'd like, though from what I see in the Exif 2 spec, they seem to be of the same kind.

FWIW, testing the results of Exiv2 modifying data in a file using the output from Exiv2 is not really conclusive of anything but the fact that Exiv2's reading and writing of metadata is consistent. :-(

The point that has been made by @Carnildo https://github.com/Carnildo about accepting smiley's and other Unicode strings outside the Basic Multilingual Plane maybe worthy of attention, however I don't intend to work on this I know almost nothing about Unicode, Jis, codepages and related technology. To make progress with this, I believe we need a specialist on the team.

Understood as well.

FWIW & IMO, the 'Artist' field is not intended (using the spec you quote) to receive anything but ASCII. If some app allow the user to enter characters from an extended char set, then the problem is really with that app not following the spec. and certainly not with Exiv2

The Exif.Image.Artist is defined in the Adobe Tiff 6.0 specification as follows: screenshot_31 https://user-images.githubusercontent.com/529982/92511632-7d598b00-f205-11ea-9a3f-0029931742c5.png

Exiv2 allows you to enter non 7-bit ascii characters into "Ascii" tags. For example:

|1155 rmills@rmillsmbp:~/temp $ exiv2 -M'set Exif.Image.Artist Copyright © 2020' Stonehenge.jpg 1156 rmills@rmillsmbp:~/temp $ exiv2 -g UserComment Stonehenge.jpg Exif.Photo.UserComment Undefined 13 charset=Ascii Robin 1157 rmills@rmillsmbp:~/temp $ exiv2 -g Artist Stonehenge.jpg Exif.Image.Artist Ascii 18 Copyright © 2020 1158 rmills@rmillsmbp:~/temp $ |

|Again, Exiv2 may allow this sort of data entry and reproduce it on output, but that does not show that that is what the spec intended, but rather verify the fact that some apps may interpret the spec to (the best of) their understanding or preference. The images provided by ||norbertj42 https://github.com/norbertj42 give ample evidence of that notion. |||

The code for tvisitor.cpp (and utilities dmpf.cpp and args.cpp) is documented in my book and available (source only) from svn://dev.exiv2.org/svn/team/book The book is available at https://clanmills.com/exiv2/book Please beware those materials are "work in progress" and change frequently. Your feedback is welcome and appreciated, however I don't provide support and hope to avoid discussion while the book is in development. These utilities are written in C++11 and I believe they build on most desktop platforms. They are "single file" programs with no dependencies.

Thank you, Robin.

Found the code and have compiled tvisitor under MSVC 2019 without much of any fuss. Had to do some casting and ignore a bunch or warnings, but it runs OK. Still have to sort out how to use the options, but that will come with time.

Arnold

clanmills commented 4 years ago

@tester0077 I've updated dmpf.cpp to build without warnings using msvc2019. I believe the other programs tvisitor/args/visitor are already "warning free".

parse.cpp produces many many warnings. It's fine to ignore those warnings because you should have no reason to use parse.exe. That is Dave Coffin's code and is currently in the repos (and built by Xcode) as I am using it to understand and document CRW. It will be removed before the book is finished. Dave's code is parse.c and mentioned in the book.

clanmills commented 3 years ago

I'm closing this issue. I don't believe we have the necessary skills to pursue this matter.

I'm delighted to say that Exiv2 has a team of 8 enthusiastic contributors and I am working on a plan to release v1.00 on 2021-12-15. We don't have the skills in the team to work with 'characters outside the Basic Multilingual Plane'.

Exiv2 / exiv2

Exiv2 doesn't correctly handle characters outside the Basic Multilingual Plane #1279