JamesHeinrich / getID3

http://www.getid3.org/
Other
1.15k stars 246 forks source link

WAV file metadata is garbled #263

Closed roundwheel closed 4 years ago

roundwheel commented 4 years ago

While reading the RIFF/WAVE metadata on a WAV file, some of the data is coming through garbled.

image

As you can see in the screenshot, this is a WAV file that was created/exported using Logic Pro X. Is this a known issue with that software, or possibly a text encoding issue? Or maybe just corrupted metadata? The audio plays fine.

JamesHeinrich commented 4 years ago

I might guess that it's storing the data in its own proprietary format. If you can create another small (preferably audio-blank) sample file in "Logic Pro X" that shows the same kind of metadata I can take a look, but no promises.

paulijar commented 4 years ago

One of my users reported a similar issue with WAV files containing RIFF ID3v2.3 tags here: https://github.com/owncloud/music/issues/788. Here it seems that some similar WAV files have the problem and others don't. The problematic files can be parsed correctly e.g. by Windows 10 Explorer, foobar2000, and Mp3tag, so the format shouldn't be too exotic.

I was able to reproduce the issue with a sample file by using getID3 demo.basic.php, so clearly the issue is within the library. I have pasted the output of the demo below. It can be seen, that getID3 has extracted both the correct track title and some garbage value. For the author field, the correct value has been extracted, but then some garbage characters have been included after the correct content.

I'll send a link to this sample file to info@getid3.org shortly.

Array
(
    [author] => Array
        (
            [0] => Logic Prohÿ¿hÿ¿¸ÿ¿v·¢tÿ¿P
        )

    [title] => Array
        (
            [0] => Ðõ›Ðõ›8j¬HÅî”´L¬0L¬õ› ‘ÿ¿ÈŽÿ¿3ÿ¿Ðõ›¿Þ¿ÞUJªªXÿ¿!(‘¿Þ¤îÄÂÓÿ¿ œÓ€ÀŸXÿ¿õ_µõ”ô  ×õ›õ›8j¬€Þ«•ª˜ÿ¿‰Í‘€Þ0œ`ôù¬ÓœÁܑнܑž?đ¯Ì‘`
            [1] => Athletic Theme (Super Mario World)
        )

    [product] => Array
        (
            [0] => Consouls 1-1
        )

    [album] => Array
        (
            [0] => Consouls 1-1
        )

    [band] => Array
        (
            [0] => The Consouls
        )

    [track_number] => Array
        (
            [0] => 01
        )

    [text] => Array
        (
            [replaygain_album_gain] => -4.37 dB
            [replaygain_album_peak] => 1.266338
            [replaygain_track_gain] => +0.07 dB
            [replaygain_track_peak] => 1.000000
        )

)
JamesHeinrich commented 4 years ago

@paulijar Thanks for the sample files. Both files contain "id3", "INFO", and "bext" RIFF chunks. The ID3v2 data inside the "id3" chunk seems sane and valid, and most likely the only thing that other programs would look at. The RIFF.INFO chunks also contain valid (if limited) data -- title, product, tracknumber. It's the bext chunk that's malformed and causing problems. Part of the data that can be stored in the bext chunk is a Title field, and since the chunk appears to be malformed a bunch of garbage data is being interpreted as being in the Title section of the chunk. By happenchance the "working" sample file just encountered nulls data when reading the Title section so doesn't generate a bad Title entry in the comments output.

It would appear that it's a common issue with Logic Pro producing invalid data. If I had to hazard a guess, I might suggest that Logic Pro is writing bext tag data elements to the correct offsets, but may not be clearing the unused part of the tag prior to writing. If writing a brand new file then no problem, but if overwriting/modifying an existing file it looks like whatever random data was at that offset in the file before is not being erased. At least that's my guess based on the random nature of the data in those locations but with some strings (e.g. "Logic Pro" and timestamps) appearing perfectly correct in the places they're supposed to be. Perhaps you should open a bug report ticket with Logic Pro if these files were generated with a current version of that software (I'm not familiar with it). If you wish to examine your "broken" sample file with a hex editor, you can see the bext chunk starts at offset 40172186 with bext, then 4 bytes 5B 02 00 00 indicating 603 bytes of data (in reality taking 604 bytes due to WORD-padded nature of RIFF files), and then following that should immediately be 256 bytes for the Title field, and so forth, but it's filled with apparently-random data, except where "Logic Pro" and "2014-08-2821:16:22" have been inserted in appropriate places.

roundwheel commented 4 years ago

Here is another sample file that you can reference.

getID3-263.wav.zip

paulijar commented 4 years ago

@JamesHeinrich Thanks for the quick response!

Looking at the mentioned bext chunk of the "broken" file with hex editor, it looks like this: image

Clearly, the CHAR Description[256] field (highlighted in the picture) starts with a null terminator. Shouldn't it then be interpreted as 0-length, regardless of the following bytes? Similarly, there is a null terminator after the string "Logic Pro", so the garbage following it should be ignored.

paulijar commented 4 years ago

@roundwheel's example file seems to have similar bext description field where the first byte is 0, and the rest of the field is filled with random bytes.

JamesHeinrich commented 4 years ago

@paulijar I had noticed the null-terminators in both the empty and non-empty strings, but it wasn't till I after your comment I re-read the spec document and indeed that is a valid interpretation. I still consider it poor programming to write null-terminated data into a fixed data space without wiping the previous contents, but I guess it's not invalid.

I have fixed the issue in https://github.com/JamesHeinrich/getID3/commit/97f8c303eb9cfc808bd455c3cba2dec6de43a3d4 and the sample files should read correctly now.

Thank you (both) for the sample files and pointing me in the right direction.

paulijar commented 4 years ago

Great, thanks! Seems to work fine now. I agree that it is probably not the best design choice to include the contents of some random uninitialized memory in a written file. In theory, it could even cause a breach of confidentiality when some random memory contents gets shared more widely than intended. It's unlikely, though, that there would be anything interesting.