drewnoakes / metadata-extractor

Extracts Exif, IPTC, XMP, ICC and other metadata from image, video and audio files
Apache License 2.0
2.57k stars 482 forks source link

Improved metadata parsing for HEIC files #594

Open davidekholm opened 1 year ago

davidekholm commented 1 year ago

Metadata-extractor can't read the EXIF and GPS tags from the referenced HEIC file. It is made with an iPhone 13 Pro with the latest iOS version and it has been edited with the iPhone Photo App.

File: https://jalbum.net/forum/servlet/JiveServlet/download/2-58642-359253-23524/IMG_8780.HEIC

(We have been testing with v2.15 of Metadata-extractor.)

acwolff commented 1 year ago

The problem is caused by the edit action with the iPhone photo app. In the attached zip file is an HEIC image which is not edited; this image gives no problem. IMG_8789.zip

StefanOltmann commented 8 months ago

@davidekholm I wanted to check your file, but it's gone by now.

Note that you can upload files to GitHub issues if you zip them first like @acwolff did.

StefanOltmann commented 8 months ago

@acwolff

this image gives no problem

The file is a bit corrupted in the sense that it contains two entries for XMP.

I dropped it into https://stefan-oltmann.de/exif-viewer/ and saw that it fails because of this. :/

Would you allow me to use this file as test data for my library https://github.com/ashampoo/kim?

acwolff commented 8 months ago

@StefanOltmann ofcourse you may use my image in your libray.

StefanOltmann commented 8 months ago

@acwolff Thank you! I will add it to the my repo with the next update. :)

You may consider adding it to https://github.com/drewnoakes/metadata-extractor-images/

It's actually a pretty good test file as it includes some oddities like two MDAT boxes, each with it's own XMP. This even broke my library, but has now been fixed. I think Drew is always happy about special cases. ;)

You can now inspect the file using https://stefan-oltmann.de/exif-viewer:

grafik grafik

Looks like a software just appended a new XMP box and ignoring the existing one.

I haven't had the chance to explore how metadata-extractor manages multiple XMP boxes, as they may potentially contain conflicting information. Currently, my approach is to consider the information from the XMP box located at the end of the file.

How did you create this file?

acwolff commented 8 months ago

@StefanOltmann i made this picture on an iPhone 13 Pro nd next probably did an edit on the iPhone .

davidekholm commented 8 months ago

@StefanOltmann , Here's a zipped version of that file: IMG_8780.HEIC.zip

StefanOltmann commented 8 months ago

Interesting.

Both files originate from an iPhone 13 Pro and exhibit a striking similarity in their file structures. Notably, using the iPhone Photo app for editing results in an unconventional file arrangement.

While my Exif viewer successfully identifies Exif and XMP in both files, metadata-extractor, unfortunately, does not. I can indeed confirm the existence of this issue.

Upon a swift debugging attempt, I was unable to pinpoint the exact problem. What I observed is that the correct skipping of the first mdat box does not occur for some reason. Consequently, it fails to detect the subsequent boxes.

The analysis never progresses to the fourth box in the file, which happens to be the meta box.

grafik
StefanOltmann commented 8 months ago

Normally the meta box comes first for HEIC. This is the case for unmodified HEIC files from my iPhone 12 Mini. This is the more common case that’s handled correctly here. Some files have mdat first or even multiple mdat. For this case there is special logic and I guess this fails here.

It is harder to handle and before yesterday my library failed on that, too.

I must say that I don’t like HEIF at all. Not only is it patent-encumbered, but there also seem to be no clear rules on the box order. It seems to be random. Exactly this leads to problems like that.

WebP specifies the metadata even to come last, which is the absolutely worst.

JPG has metadata coming first, which is great for reading from a cloud source. Metadata-extractor can stop reading the file as soon as we hit pixel data.

While not specified for PNG, browser devs decided to ignore metadata after the pixels.

So we have a good and clear situation with JPG & PNG, but HEIF & WebP are ugly for efficient metadata parsing.

Unfortunately the box order for JPEG XL is not defined, too. I opened a issue and hope they may specify or at least recommend metadata to come first.

Nadahar commented 8 months ago

MPEG-4 video aka "MP4" has the same idiotic issue too - where they put the metadata last by default. To make MP4 files "streamable", you have to specify extra options to move to "moov atom" to the front. If you think it's bad for images to have to read the whole file to get to the metadata, consider videos of several GBs.

Apparently this stems from the need to make encoding easier:

And the reason the moov atom is conventionally placed at the end is the inverse of the streamed playback problem: where playback requires knowing the index metadata before the video, generating the index metadata requires having the video encoded/available. You can't really pre-generate the moov atom so making a video streamable requires moving the atom after the video is already encoded, whether by editing the output or by buffering the entire video during the encode. You can't write a streamable video directly to an output stream in one go

MPEG-2 PS and TS doesn't have this problem (the is no MPEG-3 because the potential confusion with MP3, so MPEG-2 is the previous version). MPEG-2 also supports a much broader range of stream types and honestly I don't quite understand why "the world" has moved to the MPEG-4 format, which I consider inferior. The MPEG-2 TS format is much more complicated though, where metadata is repeated throughout the stream, and where you can have a complicated "hierarchy" of components. So, it might be a need to simplify that's the driver. A lot of stuff, like TV broadcast, still use MPEG-2 though, because MPEG-4 simply can't do what it does.

That was borderline off-topic, but I assume that it's this "issue" with MPEG-4 that has "polluted" the HEIF format as well - despite there probably not being any encoder reason to put the metadata last for images (you don't need to make an index).