drewnoakes / metadata-extractor

Extracts Exif, IPTC, XMP, ICC and other metadata from image, video and audio files
Apache License 2.0
2.55k stars 479 forks source link

Extract iTunes metadata from mp4 #542

Open tballison opened 3 years ago

tballison commented 3 years ago

Over on Apache Tika (https://issues.apache.org/jira/browse/TIKA-3412), we'd like to migrate our mp4 parsing to metadata-extractor. With the no longer apparently supported sannies parser (https://github.com/sannies/mp4parser), we're able to extract useful data from Apple boxes with this code:

https://github.com/apache/tika/blob/b284e7cfbaaced599fa56ce61e5baf65ba08f842/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L272

It looks like a metabox inside of the userdata container?

Here's our unit test:

    // Check the textual contents
    assertContains("Test Title", content);
    assertContains("Test Artist", content);
    assertContains("Test Album", content);
    assertContains("2008", content);
    assertContains("Test Comment", content);
    assertContains("Test Genre", content);

    // Check XMPDM-typed audio properties
    assertEquals("Test Album", metadata.get(XMPDM.ALBUM));
    assertEquals("Test Artist", metadata.get(XMPDM.ARTIST));
    assertEquals("Test Composer", metadata.get(XMPDM.COMPOSER));
    assertEquals("2008", metadata.get(XMPDM.RELEASE_DATE));
    assertEquals("Test Genre", metadata.get(XMPDM.GENRE));
    assertEquals("Test Comments", metadata.get(XMPDM.LOG_COMMENT.getName()));
    assertEquals("1", metadata.get(XMPDM.TRACK_NUMBER));
    assertEquals("Test Album Artist", metadata.get(XMPDM.ALBUM_ARTIST));
    assertEquals("6", metadata.get(XMPDM.DISC_NUMBER));
    assertEquals("0", metadata.get(XMPDM.COMPILATION));

Any recs on how to implement this? Any chance you'd be willing to add these features?

tballison commented 3 years ago

It looks like this data is in the udta box: �meta "hdlr mdirappl � �ilst "�nam data  Test Title cpil data  pgap data  tmpo data  '�too data  iTunes 10.5.3.3 �---- mean com.apple.iTunes name iTunSMPB �data  00000000 00000840 00000000 00000000000003C0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 w---- mean com.apple.iTunes name Encoding Params 8data vers acbf brat � srcq cdcv  �---- mean com.apple.iTunes name iTunNORM jdata  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 #�ART data  Test Artist )aART !data  Test Album Artist %�wrt data  Test Composer "�alb data  Test Album "�gen data  Test Genre trkn data  * disk data  �day data  2008 %�cmt data  Test Comments 'free"

tballison commented 3 years ago

Raw bytes for the userdata box. apple-payload.bin.zip

Our unit test file for the metadata above: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-audiovideo-module/src/test/resources/test-documents/testMP4.m4a

tballison commented 3 years ago

I did some ugly hackery: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp4/TikaMp4BoxHandler.java

and: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp4/boxes/TikaUserDataBox.java

This now works for Tika. If there's a clean way to add these changes back to metadata-extractor, please let me know. Many thanks, again, for such a great library!

payton commented 3 years ago

Hey @tballison - This is great! (fyi I will be using 'atom' and 'box' interchangeably in this message)

There is a fairly clean way to get it all added.

Under the udta container, the atom hierarchy looks like the following:

Right now, metadata-extractor handles udta as a box instead of a container. We can actually treat this as a container and then the reader will parse the remaining atoms as expected. Your code can (relatively) easily be added as a new box, the ItemListBox. The handler before the box is for a new type that is not supported, so we will also need to add a new handler for this mdir type.

It has been quite a while since I have revisited this code, but I believe this is the cleanest approach... I created a PR on my fork and added some comments to explain what's going on (along with caveats): https://github.com/payton/metadata-extractor/pull/3/files#

TLDR of what I did in that PR:

  1. Create a new handler to support mdir or 'metadata' type handlers
  2. Create a new ItemListBox to suport ilst boxes
  3. Move over processIList logic to the new ItemListBox https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp4/boxes/TikaUserDataBox.java#L136
  4. Make a few modifications to processIList to support our directory structure

Let me know if you have any questions/comments/concerns about that example! Feel free to use as much or as little (or none) of it to add your work :)

With that example, I extract the Album and Album Artist from your sample file. I have not tested this on other files, though.

[MP4] Major Brand = Apple iTunes AAC-LC (.M4A) Audio
[MP4] Minor Version = 0
[MP4] Compatible Brands = [Apple iTunes AAC-LC (.M4A) Audio, MP4 v2 [ISO 14496-14], MP4  Base Media v1 [IS0 14496-12:2003], Unknown]
[MP4] Creation Time = Sat Jan 28 13:39:18 EST 2012
[MP4] Modification Time = Sat Jan 28 13:40:25 EST 2012
[MP4] Duration = 3072
[MP4] Media Time Scale = 44100
[MP4] Duration in Seconds = 00:00:01
[MP4] Transformation Matrix = 65536 0 0 0 65536 0 0 0 1073741824
[MP4] Preferred Rate = 1
[MP4] Preferred Volume = 1
[MP4] Next Track ID = 2
[MP4 Sound] Creation Time = Sat Jan 28 13:39:18 -05:00 2012
ERROR: End of data reached.
[MP4 Sound] Modification Time = Sat Jan 28 13:40:25 -05:00 2012
[MP4 Sound] ISO 639-2 Language Code = und
[MP4 Sound] Balance = 0
[MP4 Sound] Format = MPEG-4, Advanced Audio Coding (AAC)
[MP4 Sound] Number of Channels = 2
[MP4 Sound] Sample Size = 16
[MP4 Sound] Sample Rate = 44100
[QuickTime Metadata] Album = Test Album
[QuickTime Metadata] Album Artist = Test Album Artist
[File Type] Detected File Type Name = MP4
[File Type] Detected File Type Long Name = MPEG-4 Part 14
[File Type] Detected MIME Type = video/mp4
[File Type] Expected File Name Extension = mp4
[File] File Name = tika.m4a
[File] File Size = 4770 bytes
[File] File Modified Date = Sun Jul 18 14:16:53 -04:00 2021