drewnoakes / metadata-extractor

Extracts Exif, IPTC, XMP, ICC and other metadata from image, video and audio files
Apache License 2.0
2.55k stars 479 forks source link

Simple Exif XPTitle read corrupted. #528

Closed garretwilson closed 3 years ago

garretwilson commented 3 years ago

I'm using metadata-extractor 2.15.0. I have a small input JPEG image containing no metadata sections whatsoever. I use Apache Commons Imaging 1.0-alpha2 to add two simple Exif IFD0 properties using ExifRewriter().updateExifMetadataLossy().

metadata-extractor correctly reads the Copyright value, but the value of XPTitle is parsed as "慇整愠摮吠牵敲�"

This is disheartening, as this is nearly the most simple test case possible.

Note that IrfanView 4.54 can read the XPTitle just fine! However ExifTool 12.16 (via ExifToolGUI 5.16.0.0) reads the value as "慇整愠摮吠牵敲t". So something fishy is going on.

But who is correct? And what is it about this file that causes the problem to begin with? I've attached the test case image below.

gate-turret-exif-bad-title

drewnoakes commented 3 years ago

What does the Windows file property dialog show for the tag?

garretwilson commented 3 years ago

What does the Windows file property dialog show for the tag?

The Windows file property dialog details panel shows nothing at all for the title.

garretwilson commented 3 years ago

I've confirmed that Metadata++ 1.22.14 also shows the same corrupted value, so I suspect that Apache Commons Imaging is simply broken. I've filed ticket IMAGING-281.

Still I find it odd that IrfanView reads the correct value. Perhaps there is some misinterpretation of the Exif format that is common. It might be useful for metadata-extractor to know about this type of malformed data. And of course I'd like to verify exactly what the problem is.

drewnoakes commented 3 years ago

If there is a safe heuristic we can use to recover from this kind of corruption, we can apply it. By "safe", I mean we must reliably detect this situation so that we do not degrade handling of values that work correctly today.

garretwilson commented 3 years ago

Well I'm interested in finding out what the problem is specifically, first. I'm not a big fan of a library trying to make guesses and correct bad data, unless this is a situation which is common across tools and known for the format. I presented it here for more visibility (and also because at the beginning I didn't have any idea about which tool might have been at fault). So we should probably know more about what is happening before knowing if extractor-metadata should try to "recover" in any way. Maybe it should produce an error. Or maybe it is reading the exact data the Apache Commons Imaging stored, in which case it is behaving correctly (although the case of IrfanView still puzzles me).

Nadahar commented 3 years ago

I'm not ready to dig into the Exif standard so that I can decipher the data manually, but I'm assuming that there are pointers to the the data blocks where this data is supposed to be found. What I see immediately when opening the above image in a hex-editor, is that both texts exists very early in the file, encoded mostly using ASCII characters:

43 6F 70 79 72 69 67 68 74 20 C2 A9 20 32 30 30 39 20 47 61 72 72 65 74 20 57 69 6C 73 6F 6E 00 47 61 74 65 20 61 6E 64 20 54 75 72 72 65 74 00

This is merely two null-terminated strings following each other without any other separation than the null terminator. The copyright field comes first. The big question is if the pointer to the "second" field, XPTitle actually points to the correct location, or they are just supposed to appear sequentially.

By accident, it is revealed that at least the copyright field is written using UTF-8. The reason is the © character, which is the only non-ASCII character used. It is encoded C2 A9 which only makes sense as UTF-8: U+00A9.

Since there is no such character in the XPTitle field, it could be ASCII, UTF-8, Latin1 and a whole host of other ASCII-based code pages. Most likely it's UTF-8 as well though, and this could be easily tested by encoding a non-ASCII character in the title.

However I fail when trying to find the text that is parsed by metadata-extractor and others if it were encoded as UTF-8 anywhere in the file. The 慇 character would be encoded as E6 85 87, and this byte combination isn't found anywhere in the file. This undermines my hypothesis of an erroneous pointer, so something else must be going on.

I guess one must dig into the parsing itself to figure it out. From the top of my head I'm thinking potential explanations could be that the field is decoded as if it were some eastern code page, or that there is some endianness issue.

garretwilson commented 3 years ago

By accident, it is revealed that at least the copyright field is written using UTF-8. …

Let me give you a couple of tips about Exif to make understanding this whole encoding business a little easier. The very short summary is that the (old) Exif standard says that strings must be ASCII, but hardly anyone follows that because that would be silly and hardy useful in modern times. In fact a consortium of industry leaders (including Adobe) got together and gave recommendations, saying that really we should be using UTF-8 for Exif strings. And Apache Commons Imaging should be using UTF-8 for Exif strings. Microsoft Windows 10 Explorer uses UTF-8 for Exif strings. Metadata++ uses UTF-8 strings. Twelve Monkeys defaults to UTF-8 for Exif strings. The GUI for ExifTool defaults to UTF-8 for Exif strings. In 2021 we should all be using UTF-8 in Exif by default, even if some obscure program (and I'm sure there is one somewhere) uses something else.

However I fail when trying to find the text that is parsed by metadata-extractor and others if it were encoded as UTF-8 anywhere in the file.

You probably won't find it because of of a much bigger problem in metadata-extractor, described in #270. metadata-extractor doesn't actually specify the charset explicitly, but uses whatever the default charset is (on the platform on which it happens to be running—something completely arbitrary and usually unrelated to the image) for converting bytes to a string (one of the foundational rules of what not to do when converting bytes to characters). So instead just look for where metadata-extractor converts a bytes a string; don't look for any explicit indication of UTF-8, because the lack of such an indication is the problem. (And when you find out where metadata-extractor makes this conversion, please let me know so I can fix the blocker issue #270 and file a pull request.)

The problem for this ticket as you mentioned is probably some sort of pointer error or lack of separator or phase error or lack of alignment or something, which causes all the data to be interpreted mid-stream in the UTF-8 stream. That's just a guess. But I'm almost certain it's a problem with Apache Commons Imaging. What's interesting is that when Exif Description is used instead of XPTitle, it works fine. I notice that the tag ID of XPTitle is significantly higher; I wonder if that causes some sort of shifting or overflow or something.

From the top of my head I'm thinking potential explanations could be that the field is decoded as if it were some eastern code page, or that there is some endianness issue.

I doubt seriously if this is the issue. Surely Apache Commons Imaging uses common code for storing strings in Exif, and I just passed normal strings to a simple Apache Commons Imaging method to add Exif metadata to this image. Some sort of misaligned offset or corrupt data structure is more likely.

Nadahar commented 3 years ago

As to the issue in #270, this conversion doesn't take place just one place - it's found all over the library, as I've pointed out in the issue. Finding these is very easy, just install FindBugs (older Java versions) or SpotBugs (never Java versions) and let it scan the code. It will report all instances where "default" charset is used in conversion. To properly fix it is a bit more work though, one must know what charset to explicitly use in each situation.

If the rule is as you describe for Exif (basically: use UTF-8), that takes care of all the Exif conversions. Once that is determined, the fix is extremely simple, all of these conversion methods accepts an additional parameter that should specify UTF-8. StandardCharsets.UTF_8 is the natural choice, but was introduced in Java 7, so another way is needed. Charset.forName("utf-8") can probably be used in Java 6 as well.

Metadata-extractor will use Java's "default" charset for decoding, which should be the same for all fields. Since it successfully extracts the copyright field, which does contain an UTF-8 character in your test, I would expect that it would use UTF-8 to decode the XPTitle field as well. That's why I searched the file for the would-be UTF-8 encoding of the other text that was extracted. If a pointer was merely wrong, even if it were misaligned, I should be able to find the sequence of bytes somewhere in the file. This is what puzzles me, and which indicates that something else must be going on.

What is the "default" codepage on the computer on which you did the extraction?

garretwilson commented 3 years ago

It looks like I found the source of the problem, and in this case it's because of something I was doing. Really Microsoft is to blame, but I'll explain.

Reading the ExifTool FAQ on charsets, I found a little note:

The EXIF "XP" tags (XPTitle, XPComment, etc) are always stored internally as little-endian Unicode (UCS‑2) … .

UCS-2 is a two-byte encoding of Unicode. And sure enough, metadata-extractor internally in getWindowsTitleDescription() and for the other Microsoft XP tags is using getUnicodeDescription(), which reads the value as UTF-16LE (which technically is not UCS-2 but "better" because it recognizes surrogate characters and thus supports Unicode code points over 16 bits).

I went back to Apache Commons Imaging and sure enough, they have a TagInfoXpString tag info type:

Windows XP onwards store some tags using UTF-16LE, but the field type is byte - here we deal with this.

I had been using the Apache Commons Imaging TagInfoAscii type which as we discussed above had been storing the value as UTF-8. I haven't taken the time to do so, but I am confident that if you try to read the UTF-8 encoding of "Gate and Turret" as UTF-16LE (i.e. taking two bytes at a time), you'll get the corrupted string we were seeing.

Apparently Microsoft when it added these XP tags it decided that the way to support characters above ASCII was to use UCS-2. Maybe UTF-8 wasn't popular yet. Or maybe Microsoft was just being Microsoft.

This answers all our questions except why IrfanView can read the title in UTF-8. Maybe IrfanView goes out of its way to detect UTF-8 before trying UTF-16LE. Or maybe IrfanView is coded incorrectly and only supports UTF-8. (I haven't tested it.)

In any case it seems that both metadata-extractor and Apache Commons Imaging are working correctly. I was coding something incorrectly. That makes me happy, because it's easy enough to fix what I was doing incorrectly. I'll close this ticket and you can mark it as invalid if you like.

Nadahar commented 3 years ago

UCS-2/UTF-16 makes a lot of sense to me. It's used much more than you probably think. "Wide" strings in Windows use it, and I think even Java stores strings in memory using it. There's a long and complicated history, but as far as I remember, several solutions where considered, but UCS-2 was preferred because it had a predictable size (just like the old ANSI or ASCII strings). Down at the low level, there's a lot of buffer allocation taking place, and not being able to determine how many bytes you need to reserve before doing the actual encoding (which you do directly into said buffer) is a real hassle.

Back when UCS-2 was "in the wind", all Unicode characters could be represented using two bytes. This made it predictable and thus very useful. There was also an argument that for many Asian languages for example, less space is needed to store most texts than UTF-8 does. English and most other "Latin" languages is more efficiently encoded in UTF-8 though.

When Unicode grew and the 2 byte barrier was breached, UTF-16 was "born" to make it possible to encode more characters using a similar method to UTF-8. This made the big argument for using UCS-2 largely moot, as UTF-16 doesn't have a predictable length either. If you want predictable length today, you need to use UTF-32, and I suspect that it's being used in processing/memory in many situations today for that reason.

Regardless, many decisions had already been made which made it very difficult to switch from using two bytes for storage, which is why it's still very broadly used today. UTF-16 kind of became a "drop in replacement" for UCS-2, since that was the easiest way to allow the code that already operated on UCS-2 to support the extended Unicode characters. I don't know how the details have been solved, when it comes to buffer reservation etc, but I suspect that many things running today can still fail if you use the extended code points. The UCS-2 range still covers a lot, so most situations will still seem to work though.

There's really no way to "detect" UTF-8 or UTF-16 unless you know what the text is supposed to be, so I doubt IrfanView does this. Some byte combinations are invalid in UTF-8, so you could say that "as long as all code points are valid UTF-8 we consider it UTF-8", but that would still get it wrong in the majority of cases if the source is really UTF-16. My guess is that IrfanView have simply made an incorrect assumption that turned out to "work" in this case.

I did try to decode the content of the example file both as UTF-16LE and UTF16BE without getting the same text, but I probably did something wrong. I was doing it in a hex editor, and it's easy to get it wrong. Still, this led me to exclude that explanation.

karlingen commented 1 year ago

Microsoft Windows 10 Explorer uses UTF-8 for Exif strings.

@garretwilson Do you know if this might have changed in Windows 11?

When I modify the title in Windows 11 and try to parse it using metadata-extractor version 2.18.0, I get the following output if I don't specify the encoding:

84 0 101 0 115 0 116 0 32 0 84 0 105 0 116 0 108 0 101 0 0 0

However, if I add UTF-16LE as a parameter to the getString() method, I get the following output:

Test Title??

This is the image I'm testing this with:

foster-lake

And this is the code I'm using (kotlin):

ImageMetadataReader.readMetadata(file)
    ?.getFirstDirectoryOfType(ExifIFD0Directory::class.java)
    ?.getString(ExifDirectoryBase.TAG_WIN_TITLE, "UTF-16LE")
garretwilson commented 1 year ago

Do you know if this might have changed in Windows 11?

Sorry, I have no idea. I haven't made the plunge to Windows 11 yet.

Please make sure you read this thread carefully—it's very intricate. I haven't touched this in a while, and I myself would have to reread the thread to remember exactly which encoding Microsoft uses in which field.

karlingen commented 1 year ago

I got it working by not using getString() but rather .getByteArray() and then decoding the byte array to UTF-8

garretwilson commented 1 year ago

If it is true that Windows 11 has switched from UCS‑2 to UTF-8 for the "XP" tags such as XPTitle, then that would be significant. The metadata extractor library would need to be updated to first try to decode using UTF-8; and then if that failed to try UCS‑2. (Although I wouldn't be surprised if Microsoft added another field to indicate the encoding of the "XP" tags, if they changed what it has been traditionally.)

All this sounds unexpected, though. Something doesn't sound quite right. (Although the same could be said about lots of things Microsoft does.) ~@karlingen can you confirm exactly which fields you're referring to?~ (Ah, never mind, I see it in your comment.)

garretwilson commented 1 year ago

Windows 11 seems perpetually in beta (which is one reason why I haven't switched). Maybe some new developer working on Windows 11 isn't versed in the arcane history of EXIF field encodings and assumed (as I did when I opened this ticket) that these fields used ASCII or UTF-8. It's a shame the "XP" tags didn't use UTF-8 from the start, but since they do the last thing we need is more confusion by switching back and forth (seeing that there is so much data in the wild already.)

I recommend trying to file a bug report with Microsoft. If you do, please report back by all means.