Serial-ATA / lofty-rs

Audio metadata library
Apache License 2.0
187 stars 35 forks source link

Probe::guess_file_type() fails for some MP3 files #260

Closed uklotzde closed 1 year ago

uklotzde commented 1 year ago

Reproducer

Fails: Probe::new(BufReader::new(File::open("/tmp/test.mp3").unwrap())).guess_file_type() Succeeds: Probe::open("/tmp/test.mp3")

Summary

Probe::guess_file_type() fails for some MP3 files.

Expected behavior

If opening (and reading) a file path directly works then Probe::guess_file_type() must not fail.

Assets

No response

uklotzde commented 1 year ago

Workaround: .options(ParseOptions::new().max_junk_bytes(usize::MAX))

Serial-ATA commented 1 year ago

So the default junk search window is 1024 bytes, which in most cases should be enough.

Is the data at the start of your files junk? And if so, does it really occupy so much space that 1024 bytes isn't big enough?

Probe::open is going to simply look at the .mp3 extension to guess the file type and pass it directly to MpegFile::read_from, which is more lenient in that it has no maximum for junk bytes allowed. Probe::guess_file_type will only be able to search what you allow it to.

If the data is junk, there's not anything that can really be done here.

uklotzde commented 1 year ago

I have found a bunch of files that are affected. The inconsistent behavior may affect and confuse users. Not unlikely that those files appear in the wild more often.

Unfortunately, I don't have a solution other than using a custom, unlimited look-ahead as a workaround. I need file type specific pre/post-actions during the import and have no choice other than using guess_file_type() for the dispatch.

uklotzde commented 1 year ago

Possible solution: Unconditionally use max_junk_bytes = usize::MAX for known and supported file extensions to reduce the chance for false negatives. Does not work, because guess_file_type() does not have access to a file path, but only to the file content.

Serial-ATA commented 1 year ago

The problem with increasing the junk window is that it becomes much more expensive for users who are simply reading all files in a directory, for example. Now, more time will be spent trying to determine the formats of non-audio/unsupported formats in the directory, as it will exhaust the window, copying all of the bytes, until it ultimately determines it isn't valid.

Having multiple files that are so heavily packed with junk seems odd. Do you know if the data is actually junk or is it something like a tag that just isnt supported?

uklotzde commented 1 year ago

Do you know if the data is actually junk or is it something like a tag that just isnt supported?

I could provide affected files for further analysis privately.

Serial-ATA commented 1 year ago

Sure, you can send it to serial AT [domain on my profile]. Note there's a 25MB attachment limit.

Serial-ATA commented 1 year ago

So, the files both have an ID3v2 tag with a lot of unaccounted for padding. We actually get really close to reaching the data, which is only 19 bytes outside of the junk window for both files.

There always seems to be an issue with ID3v2 padding, with few applications actually writing it correctly.

In this case, all you'd really need is an extra handful of bytes in the window, but you could also just double it: ParseOptions::new().max_junk_bytes(2048).

We may be able to correct the file in the future, I made an issue but I probably won't get around to it for awhile: #262.