ID3v2: Handle null terminators in UTF-8 encoded frames.

Serial-ATA / lofty-rs

Audio metadata library

Apache License 2.0

182 stars 34 forks source link

ID3v2: Handle null terminators in UTF-8 encoded frames. #273

Closed sublipri closed 8 months ago

sublipri commented 9 months ago

iTunes and applications that use Mutagen will add null terminators to UTF-8 encoded frames due to ambiguities in the spec. See this issue for some background:

https://github.com/quodlibet/mutagen/issues/379

This causes lofty to create an extra empty value when splitting multiple values.

Given how widespread a practise this is, it seems necessary to handle it when reading the content of a frame. I'm not sure exactly which frames can have null terminators, but I've noticed it in Text, ExtendedText, and Comment frames.

Here's an example file:

https://archive.org/download/The_Four_Seasons_Vivaldi-10361/John_Harrison_with_the_Wichita_State_University_Chamber_Players_-_05_-_Summer_Mvt_2_Adagio.mp3

Serial-ATA commented 9 months ago

Thanks!

This definitely isn't the only place the ID3v2 spec falls short, so I'm not surprised something like this came up.

The fix should just be a simple:

if let Some(stripped) = text.strip_suffix(encoding.null_terminator()) {
    text.truncate(stripped.len())
}

anytime we parse a text frame.

sublipri commented 9 months ago

Looked into this some more and discovered a lot of Latin1 encoded frames with null terminators as well, including quite a few that have multiple trailing nulls. The fields with multiple nulls were mostly iTunes-specific, but it would be nice to handle them too.

Would it be appropriate to add something like this to line 130 of the decode_text function (and change read_string to be mutable)?

if read_string.ends_with('\0') {
    let new_len = read_string.trim_end_matches('\0').len();
    read_string.truncate(new_len);

https://github.com/Serial-ATA/lofty-rs/blob/d51a151407772edf30b0ccc062f48be7d34a6378/src/util/text.rs#L127-L135

That seems to fix the issue for me. I could try putting together a pull request including a test if you like. I couldn't find reference to encoding.null_terminator anywhere - would just using the NUL char be acceptable?

Serial-ATA commented 9 months ago

Would it be appropriate to add something like this to line 130 of the decode_text function

I think the check should only be done in some helper function in id3::v2, and used in TextInformationFrame::parse/LanguageFrame::parse/etc, since this is only an ID3v2 issue (I'd imagine).

I could try putting together a pull request including a test if you like

That'd be great, otherwise I'll need a few days to get to it at least.

I couldn't find reference to encoding.null_terminator anywhere

Oops, that was something I added in a local branch awhile ago. It was never added here 😄. Since there can be multiple terminators it doesn't matter anyway. trim_end_matches('\0') works better in this case.

sublipri commented 9 months ago

I think the check should only be done in some helper function in id3::v2, and used in TextInformationFrame::parse/LanguageFrame::parse/etc, since this is only an ID3v2 issue (I'd imagine).

I found a few FLAC files with VorbisComments containing trailing nulls too (most seem to be written by old versions of JRiver). Not sure if that really matters, but I think it'd make sense to put the helper function in utils::text in case it is useful elsewhere. That would also fit with how things are currently structured - thedecode_text function is only used by id3::v2.

I'll try and get it done in the next couple of days.

Serial-ATA commented 9 months ago

Oh yeah, if this is an issue in other formats then it should be in utils::text. I've never seen terminators in Vorbis Comments, but it wouldnt hurt to skip them if they exist in the wild.

Thanks a lot for working on this as well!