craffel / mir_eval

Evaluation functions for music/audio information retrieval/signal processing algorithms.
MIT License
603 stars 112 forks source link

Add suite of alignment metrics #343

Closed f90 closed 2 years ago

f90 commented 2 years ago

Hey,

I coded up a set of alignment metrics I would like to see added in mir_eval.

To specify the scope: This can be used to evaluate alignment models which are given a sequence of events along with a piece of audio, and then return a sequence of timestamps, with one timestamp for each event, indicating the position of this event in the audio. The events are listed in order of occurrence in the audio, so that output timestamps have to be monotonically increasing.

Why a new module for mir_eval for this and not reuse an existing one? Well, this type of alignment differs from other event detection setups: Onset, beat, segment detection can have the model predict too many or too few events compared to the ground truth, and sometimes requires additional labelling. In contrast, for alignment the events to be detected are part of the input to the model, and evaluation is focused on pair-wise comparison between each output and reference timestamp for each event.

I implemented all metrics currently used for lyrics alignment in particular:

If there are some other metrics missing for related tasks that have the same set of assumptions (maybe audio to score alignment), we could also add them here.

craffel commented 2 years ago

Thanks Dan. Are these metrics published/used anywhere? We generally restrict ourselves to metrics that are somehow "standard", which typically means implemented in MIREX, but more broadly means that they've been published somewhere and used in at least one paper.

f90 commented 2 years ago

@craffel In the MIREX lyrics alignment challenge all the above mentioned metrics are used, see here for the 2020 round: https://www.music-ir.org/mirex/wiki/2020:Automatic_Lyrics-to-Audio_Alignment_Results

The exception to this is the perceptual metric, but this was proposed in the following ISMIR paper: N. Lizé-Masclef, A. Vaglio, M. Moussallam, ISMIR 2021

Maybe we should add a disclaimer for this metric that the perceptual quality here is tailored to a Karaoke context in particular and therefore should probably not be used for other alignment problems such as audio-to-score?

Unfortunately there doesn't exist a clear reference implementation for any othese metrics that is both very stable/bug-free and also commonly used - which is why I wanted to contribute this. The closest contestant would be the MIREX alignment code in this repo (which does not have the perceptual metric):

https://github.com/georgid/AlignmentEvaluation

However it is not straightforward to run properly and relies on input files of specific formats. Also it models timestamps as (start, end, token) triplets, which makes things more complicated than they should be, since the token IDs for the events don't matter and if you want to only evaluate start timestamps for example, it is a bit inflexible since you need to generate end timestamps based on the start timestamp of the next token. At the same time, this distinction is not really needed to compute the metrics properly. So in this implementation, you can feed in either only start or both start and end timestamps (as a 1D numpy array [start1, end1, start2, end2, ...] and calculate the metrics for each case if you want.

craffel commented 2 years ago

Ok, got it. Can you test against the MIREX metrics so that we can confirm the implementations match? And can you add references as appropriate (as seen e.g. here)

f90 commented 2 years ago

OK, can do. Should I create additional fixture files to the ones I already added that contain the outputs from the MIREX implementation? Or just test whether I get the same results for the current fixture files locally on my end, without modifying the PR at all?

craffel commented 2 years ago

Well, it sort of depends - if your implementation matches MIREX exactly, it certainly doesn't hurt to add those as test cases too, as long as your metric is not too expensive to evaluate. If it doesn't match MIREX, and you're confident this is because you're doing things correctly and MIREX has a bug (as we found for some metrics in the original paper), then they shouldn't be added as test cases.

f90 commented 2 years ago

I added the only test case that was included in the MIREX eval code as regression tests into this PR - with the caveat that i had to transform the (start, end) inputs into a 1D list of timestamps first, and deleted an entry that is not counted during evaluation since it is an additional timing for a previous segment, for which no ground truth exists. Only after that I have an equal number of reference and estimate timestamps.

I get idential results for the metrics for both this PR and the MIREX repo, except for one detail in the PCS metric. Here the beginning silence before the first event is detected was not counted as a correct segment in the MIREX code, which is not the correct behaviour, since the metric is just supposed to give the percentage of time of segment overlaps, and the initial segment before the first detected event is also a segment to be counted. So for this metric I added the values calculated by the new pcs method instead to the fixture files.

I also added the relevant references.

Please have a look and see if it looks ready now :) Thank you

f90 commented 2 years ago

Thanks for reviewing! Changed the metrics names to something more self-explanatory. Also addressed all the smaller comments and made it 80 chars per line everywhere. We only need to resolve the issue around using/getting the total song duration for the percentage_correct_segments (PCS) metric, see the discussion above for details.

f90 commented 2 years ago

OK, the changes to the percentage of correct segments are implemented as you suggested, please have another look :)

f90 commented 2 years ago

Ah sorry, overlooked that there were still comments in the change request. I worked them all out now!

craffel commented 2 years ago

Thanks! Bug me if you want me to get a new release on PyPI...

f90 commented 2 years ago

Thanks for working with me on this! It would be great if you could push out a new version in the next few days since I am working on multiple projects that need these metrics, so I can stop duplicating code across projects

craffel commented 2 years ago

Done! Let me know if there are any issues... has been a few years since the last release. https://pypi.org/project/mir_eval/