Open tballison opened 3 years ago
It looks like this data is in the udta
box: �meta "hdlr mdirappl � �ilst "�nam data Test Title cpil data pgap data tmpo data '�too data iTunes 10.5.3.3 �---- mean com.apple.iTunes name iTunSMPB �data 00000000 00000840 00000000 00000000000003C0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 w---- mean com.apple.iTunes name Encoding Params 8data vers acbf brat � srcq cdcv �---- mean com.apple.iTunes name iTunNORM jdata 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 #�ART data Test Artist )aART !data Test Album Artist %�wrt data Test Composer "�alb data Test Album "�gen data Test Genre trkn data * disk data �day data 2008 %�cmt data Test Comments 'free"
Raw bytes for the userdata box. apple-payload.bin.zip
Our unit test file for the metadata above: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-audiovideo-module/src/test/resources/test-documents/testMP4.m4a
I did some ugly hackery: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp4/TikaMp4BoxHandler.java
This now works for Tika. If there's a clean way to add these changes back to metadata-extractor, please let me know. Many thanks, again, for such a great library!
Hey @tballison - This is great! (fyi I will be using 'atom' and 'box' interchangeably in this message)
There is a fairly clean way to get it all added.
Under the udta
container, the atom hierarchy looks like the following:
udta
meta
hdlr
(handler type: mdir
)ilst
free
Right now, metadata-extractor handles udta
as a box instead of a container. We can actually treat this as a container and then the reader will parse the remaining atoms as expected. Your code can (relatively) easily be added as a new box, the ItemListBox. The handler before the box is for a new type that is not supported, so we will also need to add a new handler for this mdir
type.
It has been quite a while since I have revisited this code, but I believe this is the cleanest approach... I created a PR on my fork and added some comments to explain what's going on (along with caveats): https://github.com/payton/metadata-extractor/pull/3/files#
TLDR of what I did in that PR:
mdir
or 'metadata' type handlersilst
boxesprocessIList
logic to the new ItemListBox https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp4/boxes/TikaUserDataBox.java#L136processIList
to support our directory structureLet me know if you have any questions/comments/concerns about that example! Feel free to use as much or as little (or none) of it to add your work :)
With that example, I extract the Album and Album Artist from your sample file. I have not tested this on other files, though.
[MP4] Major Brand = Apple iTunes AAC-LC (.M4A) Audio
[MP4] Minor Version = 0
[MP4] Compatible Brands = [Apple iTunes AAC-LC (.M4A) Audio, MP4 v2 [ISO 14496-14], MP4 Base Media v1 [IS0 14496-12:2003], Unknown]
[MP4] Creation Time = Sat Jan 28 13:39:18 EST 2012
[MP4] Modification Time = Sat Jan 28 13:40:25 EST 2012
[MP4] Duration = 3072
[MP4] Media Time Scale = 44100
[MP4] Duration in Seconds = 00:00:01
[MP4] Transformation Matrix = 65536 0 0 0 65536 0 0 0 1073741824
[MP4] Preferred Rate = 1
[MP4] Preferred Volume = 1
[MP4] Next Track ID = 2
[MP4 Sound] Creation Time = Sat Jan 28 13:39:18 -05:00 2012
ERROR: End of data reached.
[MP4 Sound] Modification Time = Sat Jan 28 13:40:25 -05:00 2012
[MP4 Sound] ISO 639-2 Language Code = und
[MP4 Sound] Balance = 0
[MP4 Sound] Format = MPEG-4, Advanced Audio Coding (AAC)
[MP4 Sound] Number of Channels = 2
[MP4 Sound] Sample Size = 16
[MP4 Sound] Sample Rate = 44100
[QuickTime Metadata] Album = Test Album
[QuickTime Metadata] Album Artist = Test Album Artist
[File Type] Detected File Type Name = MP4
[File Type] Detected File Type Long Name = MPEG-4 Part 14
[File Type] Detected MIME Type = video/mp4
[File Type] Expected File Name Extension = mp4
[File] File Name = tika.m4a
[File] File Size = 4770 bytes
[File] File Modified Date = Sun Jul 18 14:16:53 -04:00 2021
Over on Apache Tika (https://issues.apache.org/jira/browse/TIKA-3412), we'd like to migrate our mp4 parsing to metadata-extractor. With the no longer apparently supported sannies parser (https://github.com/sannies/mp4parser), we're able to extract useful data from Apple boxes with this code:
https://github.com/apache/tika/blob/b284e7cfbaaced599fa56ce61e5baf65ba08f842/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L272
It looks like a metabox inside of the userdata container?
Here's our unit test:
Any recs on how to implement this? Any chance you'd be willing to add these features?