Serial-ATA / lofty-rs

Audio metadata library
Apache License 2.0
185 stars 34 forks source link

ParseOptions: ignore `Picture` setting #186

Closed hinto-janai closed 2 months ago

hinto-janai commented 1 year ago

Consider the following situation:

You have 10,000 audio files you want to probe, totaling to around 500 albums:

for file in files {
    let probe  = lofty::Probe::new(file)
    let tagged = probe.guess_file_type()?.read()?;

    // Do stuff with TaggedFile.
}

Currently, this fully scans every file, including the Picture bytes, meaning: every single file allocates a Vec for the picture bytes, even if we have already scanned them, or don't want them in the first place (e.g, we only want the core tag metadata).

I'd like to avoid allocating bytes that I already have, so my proposal:

for file in files {
    let options = lofty::ParseOptions::new().read_picture(false); // This doesn't exist.
    let probe   = lofty::Probe::new(file).options(options);
    let tagged  = probe.guess_file_type()?.read()?;

    // Do stuff with TaggedFile (that doesn't have a Picture).
}

After testing around, this cuts probing time in half for files with non-trivial picture data (1200x1200+).

I'm willing to submit a PR for this, but I was wondering if this ParseOptions::read_picture() API was the correct way forward, or if this should be done another way. It would add another branch each file parse since every version of read_from() would have this added:

if parse_options.read_picture {
    // Push picture bytes.
}
Serial-ATA commented 1 year ago

An addition to ParseOptions would be the way to go about this. :)

I'm curious though, how did you go about testing this? When items are encountered, they get read into memory before being interpreted, so there would be allocations regardless.

hinto-janai commented 1 year ago

This was commented out in the tests (it was a match before though).

https://github.com/Serial-ATA/lofty-rs/blob/46127a09122c71671ae196e530f0f804cd09ce51/src/flac/read.rs#L93-L97

I'm now seeing that the allocation is not even here. In my case, I'm iterating in a hot loop over 1000s of files so I guess the encoding + push time was adding up.

If this read were prevented in Block::read, the bytes wouldn't even be allocated, right?

https://github.com/Serial-ATA/lofty-rs/blob/46127a09122c71671ae196e530f0f804cd09ce51/src/flac/block.rs#L32-L37

Something like a check before continuing after line 32 works but maybe is too invasive? Each Block::read would have to take in a ParseOptions and branch. It would have to return a signal to the calling read_from, letting it know to continue as well.

What is the right way to go about this? I'd like to make this easy to merge.

Serial-ATA commented 1 year ago

Yes, the allocation would be avoided completely if you just seek over the content in Block::read.

A nice way you could go about this would be to make Block::read take a closure:

let block = Block::read(reader, |block_type| {
    block_type == BLOCK_ID_VORBIS_COMMENTS
        || (block_type == BLOCK_ID_PICTURE && parse_options.read_pictures)
});

Then just change Block::read:

if predicate(block.ty) {
    // Read block content
} else {
    // Seek over content and return empty Vec
}
hinto-janai commented 1 year ago

I will work on this and eventually submit an initial draft PR (probably in the next few weeks).

hinto-janai commented 1 year ago

Sorry. I won't be working on this.

Serial-ATA commented 1 year ago

This should be a pretty easy feature to implement. I'll keep this open so I remember to eventually work on it.