Guidance on the right metric?

davies-w commented 1 year ago

Hi,

I'm trying to evaluate the use of the McFee segment finding algorithm (from the Librosa examples) for determining skip points for a song. As such, I really don't care about the nature of the sections. I have human reference points which are typically over-provided for. What I'd like is a continuous metric that tells me how close to the possible human provided ones we are.

I don't think any of the existing metrics really do this - or am I mistaken? The problem as I see it is that we don't want to use "tolerances" for determining binary precision/recall style metrics. And we also don't want to look at pairwise, as we really don't care about labels.

In theory it'd be closer to comparing beat detections, but those seem to have a much more regular pattern to them. Anyone got thoughts?

The closest I've seen so far seems to be the Hanna Lukashevich metric, but that still seems to require labels on both sides.

I've also tried to describe the nature of the metric as having N targets on a timeline, and then measuring the overall accuracy of M shots against them, but this falls short - eg what do you do if two shots hit the same target, or what do you do about the reference having many potential targets, but it'd be ok if we got a few shots close to a couple of targets, ideally perhaps evenly spaced.

davies-w commented 1 year ago

Quick update, it looked like the deviation was what I wanted, until I discovered that there's a weirdness with using median as the aggregating function - I'm guessing there's a reason for using median, but is it an overwhelming one, and I happened to find a pathological example too easily?

https://i.imgur.com/lBmll0D.png

I actually think I'll just use this, but replace median with sum (as I care about the total errors really).

bmcfee commented 1 year ago

I have human reference points which are typically over-provided for. What I'd like is a continuous metric that tells me how close to the possible human provided ones we are.

I don't think any of the existing metrics really do this - or am I mistaken?

Not exactly, but a few of them do come close to what you describe. Before getting too far into the weeds, I'd recommend that you take a look at a survey paper on structure analysis that we put out a few years ago: https://transactions.ismir.net/articles/10.5334/tismir.54 . Section 2.4 specifically addresses the various evaluation criteria that are out there, both for boundary estimation and structural labeling.

One metric that doesn't get used very often, but that might actually fit your bill, is our "t-measure" from our 2015 paper. This is a predecessor of the L-measure for hierarchical segmentation, and works on basically the same principle. It's a bit subtle, but it does capture (approximately) a continuous idea of boundary agreement. The basic idea is to divide the track up into frames, then look at the rank ordering of pairwise similarity between frames induced by the estimated segmentation, and see how well that rank ordering agrees with that of the reference segmentation. This does not require segment labels. It also works with flat annotations (see figure 2 in the paper), but it sounds like you actually do have hierarchies to work with. It does require a time window to be set, but this is really intended as a way to focus the evaluation on times in the vicinity of boundaries, and not as a tolerance window for strict matching.

If that sounds too complicated - and I wouldn't blame you if it does! - you could also just ignore the segmentation metrics, and use the onset detection metrics instead.

I've also tried to describe the nature of the metric as having N targets on a timeline, and then measuring the overall accuracy of M shots against them, but this falls short - eg what do you do if two shots hit the same target, or what do you do about the reference having many potential targets, but it'd be ok if we got a few shots close to a couple of targets, ideally perhaps evenly spaced.

The onset detection metrics (and boundary detection metrics) in mir_eval are designed to avoid this problem - events can be matched at most once, and the evaluation finds the (a) largest possible matching between estimates and reference events.

I actually think I'll just use this, but replace median with sum (as I care about the total errors really).

The deviation metrics are kind of tricky in practice. The purpose for using median aggregation here is to discard outliers, eg in the event that a reference boundary is completely missed by the estimator (but other boundaries are detected well). A sum or mean aggregation would be completely thrown off by this, and give a misleading score.

davies-w commented 1 year ago

Thank you SO much for the long reply Professor McFee!

craffel / mir_eval

Guidance on the right metric? #363