Framewise transcription evaluation

stefan-balke commented 7 years ago

TL;DR: Basically all I'm asking for is taking frames as inputs to @rabitt's mir_eval.multipitch module.

Hi everyone,

recent transcription papers:

Kelz et al., "On the Potential of Simple Framewise Approaches to Piano Transcription ". https://arxiv.org/abs/1612.05153
Sigtia et al., "An End-to-End Neural Network for Polyphonic Piano Music Transcription"

This is basically http://scikit-learn.org/stable/modules/model_evaluation.html#multiclass-and-multilabel-classification with using the ~~macro~~ samples parameter, but scaled with the number of ~~frames~~ labels.

As it seems that people use it, would it be useful to have this in mir_eval? If we go with the scikit-learn implementation which I would strongly suggest, this adds it back as a dependency. Opinions?

craffel commented 7 years ago

Thanks for bringing this up. A few questions -

1) Is there a reference implementation for these, e.g. something we can compare the scores generated by a hypothetical mir_eval implementation to? 1) Is this metric widely (e.g. community consensus) accepted, i.e. will it be used in MIREX? 1) Is scikit-learn in a better place in terms of easy installability (e.g. thanks to wheels, or whatever)? The reason we made effort to not include it as a dependency was because installation was non-trivlal, e.g. IIRC it was not straightforward to get it installed on the Heroku instance which runs mir_eval as a service. It also made it impossible to create binaries via pyinstaller, but we gave up on that #65 .

justinsalamon commented 7 years ago

fwiw, in sound event detection (SED) there seems to be a growing preference for frame-(or "segment", which is basically some fixed time duration)-based evaluation over event-based evaluation (which is equivalent to note-based), because the latter is very penalizing (consider the case where an algorithm returns two consecutive notes for a single reference note - the first note would only be a match if you ignore offsets and the second would always be treated as wrong, even though both match the reference in pitch and time if you ignore the split). So regardless of what the trend in MIREX is (I'm abroad and can't seem to load the mirex website right now), I expect we'll see frame-level metrics used more and more in transcription papers.

In this context I should mention there was an interesting attempt at introducing more note-based transcription metrics in order to provide greater insight into system performance precisely due to this issue by Molina et al., though it was focused on singing transcription and I'm not sure whether it has been adopted by the community.

With regards to @craffel's questions:

Wouldn't sklearn itself be a reference implementation, given that frame-level metrics are kinda domain agnostic (every pitch in every frame is either right/wrong, and from that you compute the F-score as you would for any IR problem)? Hmm, as I write this, I guess you do have to define what a "hit" means, especially when comparing to annotations with a pitch resolution finer than semitones. I wonder whether there's a consensus there? (e.g. in melody ex. the pitch distance must be within 50 cents for a hit)
Bad internet = can't check the mirex site. But my hunch is that these metrics will become increasingly popular.
Donno.

stefan-balke commented 7 years ago

Bad internet = can't check the mirex site. But my hunch is that these metrics will become increasingly popular.

No, it seems to be down...

One point of reference is: M. Bay, A. F. Ehmann, and J. S. Downie, “Evaluation of multiple-F0 estimation and tracking systems.” in Proceedings of the 9th International Society for Music Information Retrieval Conference (ISMIR), 2009, pp. 315–320.

Relating to sklearn's function, there is still confusion on which aggregation function to use (micro or samples) Pinging @sidsig, @fdlm, and @emmanouilb: Maybe you can enlighten us here a little bit?

justinsalamon commented 7 years ago

One point of reference is: M. Bay, A. F. Ehmann, and J. S. Downie, “Evaluation of multiple-F0 estimation and tracking systems.” in Proceedings of the 9th International Society for Music Information Retrieval Conference (ISMIR), 2009, pp. 315–320.

That's a multi-f0 tracking paper though, not transcription. It's important not to confound these, as transcription is a more involved task requiring segmentation/quantization into discrete note events.

Perhaps I should add a qualification to my SED analogy - for SED I think it can be less important to focus on discrete events (depending on the source!) and rather consider presence/absence over time. However, in music discrete notes are very much a thing, and music notation is a well established paradigm (as is piano-roll), so I'd be reluctant to abandon note-based eval for transcription altogether.

I think the most complete option is to compute both frame and note-level metrics, as done by Sigtia et al., so it would be nice to support that.

fdlm commented 7 years ago

That's a multi-f0 tracking paper though, not transcription. It's important not to confound these, as transcription is a more involved task requiring segmentation/quantization into discrete note events

Right! But I think frame-wise evaluation would work the same for both multi-f0 tracking and 'real' transcription. I don't know if it makes sense for note transcription, though.

Anyways, following formulas 1, 2, and 3 in Bay et al., assuming we sampled predictions and targets at a specified frame rate, we get two bit vectors pred and targ. Then, computing true positives, false positives, false negatives, precision and recall is just

    tp = float((pred & targ).sum())
    fp = float((pred & ~targ).sum())
    fn = float((targ & ~pred).sum())

    p = tp / (tp + fp)
    r = tp / (tp + fn)
    f1 = 2 * p * r / (p + r)

This corresponds to the micro setting in sklearn.

I think the most complete option is to compute both frame and note-level metrics, as done by Sigtia et al., so it would be nice to support that.

Definitely.

stefan-balke commented 7 years ago

That's a multi-f0 tracking paper though, not transcription. It's important not to confound these, as transcription is a more involved task requiring segmentation/quantization into discrete note events.

I just tracked down the literature given by Sigtia et al. and I agree, defining the task is very important here. For me music transcription involved the step of aggregating framewise candidates to a sequence of notes (which then matches the MIDI-like list of note events for the transcription metrics).

So, how do you call then the step of having only the framewise candidates? multi-f0 tracking?

rainerkelz commented 7 years ago

actually, the formulas in the two papers linked by stefan are very likely the wrong ones, ... i wrote up the whole ugly mess here: evaluation_shenanigans.pdf

TL;DR: when i finished the paper, i just copied the formulas over at the last minute, w/out double-checking -- in sigtia's paper they actually reference the paper that defines the measures as in the 'micro' setting in sklearn (bay et. al. 2009), but write the (unnormalized, kind of non-sensical in this form) formulas for the 'samples' setting. the actual evaluation used is in our paper is equivalent to the 'micro' setting for sklearn.

justinsalamon commented 7 years ago

Btw, note-level eval is already implemented in mir_eval, as is multi-f0. So (assuming the implementations are correct) to get frame-level metrics for transcription you'd just have to sample the note events onto a fixed time grid (which I think is also already implemented somewhere) and then feed that into the multi-f0 metrics, as @stefan-balke noted in the first comment. Pinging @rabitt

craffel commented 7 years ago

to get frame-level metrics for transcription you'd just have to sample the note events onto a fixed time grid (which I think is also already implemented somewhere) and then feed that into the multi-f0 metrics

Based on my understanding of the metrics being discussed here, this seems correct. What functionality is currently missing?

craffel commented 7 years ago

Ping. If there is any functionality missing, please make it clear; otherwise, I will close.

justinsalamon commented 7 years ago

Last I heard we'd reached agreement on how this should be implemented, but I assumed @stefan-balke was the one who was actually going to do it?

stefan-balke commented 7 years ago

Yep, on my list. @justinsalamon, see you at ICASSP then :)

craffel / mir_eval

Framewise transcription evaluation #231