AOMediaCodec / libavif

libavif - Library for encoding and decoding .avif files
Other
1.53k stars 195 forks source link

An example of XMP metadata that triggers the error "Error writing PNG: XMP metadata contains an invalid null character" in avifpng.c #1333

Open wantehchang opened 1 year ago

wantehchang commented 1 year ago

Yannis: Please take a look at the XMP metadata in this image and see if avifpng.c is rejecting it correctly.

Steps to reproduce:

  1. Download moon.jpg from https://crbug.com/1405727.
  2. avifenc moon.jpg moon.avif
  3. avifdec moon.avif moon.png

The avifdec command fails with the error message:

Error writing PNG: XMP metadata contains an invalid null character

wantehchang commented 1 year ago

Running avifdec in the debugger, I found that the null character is at the end of the avif->xmp.data buffer.

wantehchang commented 1 year ago

The XMP specification says "When XMP is encoded as UTF-8, there are no zero bytes in the XMP Packet." It seems that the XMP metadata in moon.jpg has an XMP packet that is incorrectly null-terminated.

y-guyon commented 1 year ago

Error writing PNG: XMP metadata contains an invalid null character

comes from

https://github.com/AOMediaCodec/libavif/blob/d48836b9e6d34a6a090e49eb3490fdf69eee90a6/apps/shared/avifpng.c#L481-L485

I assume the issue is that avifenc can generate a file that avifdec refuses. Solutions I can think of:

Opinions?

kmilos commented 1 year ago

IMHO, it makes sense to try and sanitize the XMP chunk at the point of import, from any format (this is not tied to PNG spec only, this is in the XMP spec), so you don't even produce an "invalid" AVIF in the first place. I think this is the approach we took in exiv2 as well as libexpat was throwing a similar error and refused to parse it from a sample JPEG...

y-guyon commented 1 year ago

@kmilos Thanks for the info. Do you have any pointer to exiv2 or libexpat code implementing what you described?

kmilos commented 1 year ago

Fond it - take a look at https://github.com/Exiv2/exiv2/pull/2133 and the related issue report.

libexpat just assumes its input is not terminated and simply barfs if it is, it has no workarounds for it. I remember even looking into to the libexpat code to convince myself that the nul character was explicitly in the list of invalid XML characters they check for...

wantehchang commented 1 year ago

We need to first make sure the terminating null byte comes from moon.jpg rather than our code in avifJPEGRead(). That is, we should verify that libjpeg does not null-terminate the data in marker->data and include the appended null byte in marker->data_length. I believe this is the case, but it would be good for another person to doublecheck it.

Once we know avifJPEGRead() is correct, then we change it to handle this kind of XMP metadata error. We can either report this error or fix up this error. If we fix up this error, I suggest we only strip the null byte at the end because it is very likely a programming error. If a null byte occurs in the middle of an XMP packet, that is more likely corrupted data and we should not try to fix that up.

We can leave the check during PNG export unchanged.

kmilos commented 1 year ago

This has nothing to do w/ libjpeg per se, who knows how that original file was created....

jzern commented 1 year ago

As noted, exiv2 (0.27.5) fails to parse the file:

$ exiv2 -p x print moon.jpg 
Error: XMP Toolkit error 201: Error in XMLValidator
Warning: Failed to decode XMP metadata.

exempi (version 2.6.3) will dump the packet, however.

$ exempi -x moon.jpg 
dump_xmp for file moon.jpg
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Exempi + XMP Core 6.0.0">
 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about=""
...
kmilos commented 1 year ago

As noted, exiv2 (0.27.5) fails to parse the file:

Yep, the fix is in exiv2 0.27.6

Btw, exiftool -v5 can also provide insight into all of the chunks for most formats, and for moon.jpg I do see the culprit in the file:

JPEG APP1 (3629 bytes):
    011c: 68 74 74 70 3a 2f 2f 6e 73 2e 61 64 6f 62 65 2e [http://ns.adobe.]
    012c: 63 6f 6d 2f 78 61 70 2f 31 2e 30 2f 00 3c 3f 78 [com/xap/1.0/.<?x]
    013c: 70 61 63 6b 65 74 20 62 65 67 69 6e 3d 22 ef bb [packet begin="..]
    [snip]
    0f2c: 20 20 20 20 20 20 20 20 20 3c 3f 78 70 61 63 6b [         <?xpack]
    0f3c: 65 74 20 65 6e 64 3d 22 77 22 3f 3e 00          [et end="w"?>.]
y-guyon commented 1 year ago

We need to first make sure the terminating null byte comes from moon.jpg rather than our code in avifJPEGRead(). That is, we should verify that libjpeg does not null-terminate the data in marker->data and include the appended null byte in marker->data_length. I believe this is the case, but it would be good for another person to doublecheck it.

I confirm that at the following point:

https://github.com/AOMediaCodec/libavif/blob/d48836b9e6d34a6a090e49eb3490fdf69eee90a6/apps/shared/avifjpeg.c#L532

standardXMPData[standardXMPSize-1] is 0 when reading moon.jpg and 62 when reading paris_exif_xmp_icc.jpg.

wantehchang commented 1 year ago

Miloš, Yannis: Thank you very much for confirming our code is correct and the error is in the moon.jpg file.

Should we report an error if an XMP packet is not valid UTF-8? Or just check if the XMP packet contains a zero byte (similar to the check in avifpng.c)?

Our error message can point to tools that can be used to inspect and correct the errors in the XMP packet.

y-guyon commented 1 year ago

The XMP specification says "When XMP is encoded as UTF-8, there are no zero bytes in the XMP Packet."

I found that sentence only in XMP spec Part 3 section 1.1.2 GIF (Graphic Interchange Format). Does it apply to any XMP chunk in any format (including JPEG)? The PNG spec forbids it too, I did not check other formats.

Should we report an error if an XMP packet is not valid UTF-8? Or just check if the XMP packet contains a zero byte (similar to the check in avifpng.c)?

I started writing a change involving a new avifFixImageXMP() in avifutil.h, that is called from avifJPEGRead() and avifPNGRead(). It just trims trailing \0s. It could also be called during AVIF decoding. \ If \0 is forbidden in XMP chunks in JPEG and in AVIF, we can add a check at AVIF encoding. Otherwise, I suggest we just leave it like that: carry any non-trailing \0 over to AVIF and JPEG exports.

kmilos commented 1 year ago

I started writing a change involving a new avifFixImageXMP() in avifutil.h, that is called from avifJPEGRead() and avifPNGRead().

I feel you should do it at a single/common entry point (e.g. avifImageSetMetadataXMP() or some such), not for every possible image format (easy to forget when new ones are added)...

If \0 is forbidden in XMP chunks in JPEG and in AVIF,

It is forbidden in UTF-8 XML period, nothing to do w/ images. As I said, libexpat just flatly refuses to parse null-terminated XML, no matter where it comes from. From https://en.wikipedia.org/wiki/Valid_characters_in_XML:

Note that the code point U+0000, assigned to the null control character, is the only character encoded in Unicode and ISO/IEC 10646 that is always invalid in any XML 1.0 and 1.1 document.

y-guyon commented 1 year ago

I feel you should do it at a single/common entry point (e.g. avifImageSetMetadataXMP() or some such), not for every possible image format (easy to forget when new ones are added)...

I considered that, but wondered whether XMP chunks provided by users through the API should be altered. For example, someone setting 100 bytes of XMP, then accessing avifImage.xmp.data[99]. If the behavior is described in avif.h, I guess that is fine. \ avifImageSetMetadataXMP() could also check for non-trailing \0 and return an avifBool. \ The main issue I had with this approach is that there is currently no single entry point. Extended XMP in avifJPEGRead() does not use avifImageSetMetadataXMP().

not for every possible image format (easy to forget when new ones are added)...

I agree. Although it is just a few formats in libavif (for now...).

It is forbidden in UTF-8 XML period, nothing to do w/ images.

Great, this simplifies the fix. Thanks for the reference.

wantehchang commented 1 year ago

Yannis: Since I filed this issue, I wanted to clarify that after confirming our code is correct and the error is in the moon.jpg file, I think our current policy is fine (importing metadata faithfully, and only checking for errors that break the output format, such as zero bytes for a PNG iTXt chunk). As the bug reporter, I think it is okay to close this issue.

If you'd like to remove trailing null characters on import, I suggest we remove only one (to handle the off-by-one bug of including the terminating null character of a null-terminated string). Fixing up errors will hide them and make it harder to track down the source of the errors, so if an extra terminating null character is not a common error, we don't really need to fix it up.

kmilos commented 1 year ago

Well, then the bug is that you only check/warn for PNG, while it should be for all output formats.

y-guyon commented 1 year ago

I think an avifenc success followed by an avifdec error is a concern and should be fixed. Either both should succeed or both should fail. avifenc should not generate files that avifdec cannot convert to a widespread format such as PNG. I suggest avifenc to choke on \0, because we already check Exif. The alternative would be to make avifdec succeed with a warning such as The XMP chunk was discarded because it contains an invalid character.

In parallel, removing a single trailing null character on any import seems acceptable.

kmilos commented 1 year ago

As an alternative, you could just not fail at all on either avifdec or avifenc, and just copy the XMP chunk around as is, and defer the problem to other 3rd party apps, with preferably just a "YMMV" style warning given to the user if this condition was detected.

Or add an avifenc option to chomp invalid XML if keeping the original is imperative (or the other way around - default to valid output, but give an option to keep original)...

StefanOltmann commented 1 year ago

I found this topic very interesting!

I encountered the same issue when I discovered that the pre-installed files IMG_0001.jpg and IMG_0002.jpg in the iOS simulator also have the illegal NUL terminator.

Initially, I considered whether the issue should be addressed in https://github.com/Ashampoo/kim or not. However, I realized that even the command "exiftool -xmp -b IMG_0001.JPG > out.xmp" exports the NUL character at the end. Therefore, it seems correct that the XMP string is exported as is - with all errors.

In the end, I made the decision that the proper approach would be to handle this issue within the XMP parser. I fixed it in my case in https://github.com/Ashampoo/xmpcore/pull/4/files.

I want to express my gratitude for all the insights shared during this discussion. Thank you!

kmilos commented 1 year ago

However, I realized that even the command "exiftool -xmp -b IMG_0001.JPG > out.xmp" exports the NUL character at the end. Therefore, it seems correct that the XMP string is exported as is - with all errors.

Not quite. Note that the -b option just dumps the raw byte stream without any parsing, it doesn't care/know if it's XML or not. If exiftool were to parse & copy instead, it might drop it...

StefanOltmann commented 1 year ago

Not quite. Note that the -b option just dumps the raw byte stream without any parsing, it doesn't care/know if it's XML or not. If exiftool were to parse & copy instead, it might drop it...

Yes, that's correct. If ExifTool parses the actual XMP and writes new it looks very different.

In my case I have two libraries: Kim can read EXIF and extract the raw XMP string and XMP Core is for actual parsing of XMP.

At first I was unsure where I should fix the problem and wanted Kim to output a trimmed XMP string. But after seeing that ExifTool extracts the untouched original I decided to do the same. The XMP parser must be the one resilient to this problems.

ExifTool ignores the invalid char if parsing the data, but it exports the RAW data exactly as it is. I think this is the correct behavior.

StefanOltmann commented 1 year ago

I just noticed that also the XMP of the embedded preview JPG of my RAF files produced by my X-T4 have NUL characters at the end. In that case two.