craffel / mir_eval

Evaluation functions for music/audio information retrieval/signal processing algorithms.
MIT License
588 stars 109 forks source link

Generalized Melody Metrics #324

Closed rabitt closed 4 years ago

rabitt commented 4 years ago

Implements #317 , cc @juanjobosch

The metrics behave identically as before, but now users can optionally specify a voicing array, which overwrites any voicing coming from negative frequency values.

Note that I (mostly) removed all conversion to cents, since the conversion cancels out in the metrics. The only place where the cents values remain is in resample_melody_series. Right now when the frequency values are resampled, they're resampled in cents, not in Hz. If we were to instead resample in Hz, and all of the metrics will go up (slightly) - but I left it as is for now.

This PR is missing:

justinsalamon commented 4 years ago

@craffel per MIREX convention, melex algs typically output values in Hz (even if they internally work in cents, they'll typically convert to Hz for the final output). The instructions for MIREX 2019 seem unchanged with respect to previous years (algs need to output frequency values in Hz with 0/neg values to indicate unvoiced frames), so I guess the convention is still going strong.

I think it'd be great to support taking in a voicing vector, but agree we should remain backwards compatible.

craffel commented 4 years ago

Ok, in that case why did we design our metrics to take in cents in the first place?

justinsalamon commented 4 years ago

My memory might fail me, but IIRC, they did originally take in Hz (at least in the very initial implementation pass), but since they involve an internal conversion to cents to compute the metrics, at some point we decided to factor the conversion out and provide utility functions. Still, our high-level functions such as evaluate still expect input in Hz.

rabitt commented 4 years ago

The metrics used to (and still do in the PR) take frequency values in Hz. Internally they are converted to cents, but the conversion isn't necessary to compute the metrics. In the PR I removed the conversion to simplify the code, and the metrics are mathematically equivalent.

craffel commented 4 years ago

The metrics used to (and still do in the PR) take frequency values in Hz.

Not sure what you mean - I think they currently take cents, e.g. https://github.com/craffel/mir_eval/blob/master/mir_eval/melody.py#L429 https://github.com/craffel/mir_eval/blob/master/mir_eval/melody.py#L494 https://github.com/craffel/mir_eval/blob/master/mir_eval/melody.py#L576 The evaluate function takes in Hz and converts to cents, but user code might call the metric functions directly instead of evaluate.

rabitt commented 4 years ago

@craffel mmm yep you're right - I was thinking just about melody.evaluate. You and @lostanlen are right that while melody.evaluate is backwards compatible (if I change back cent_tolerance), the other functions are far from it.

Side note. The way the individual metrics are written in the current release makes them hard to use individually - if you want to call them using "standard" melody outputs (frequencies in Hz, negative frequencies indicating voicing, different time scales) you have to first manually call melody.to_cent_voicing and before the individual metrics would work properly.

How should I proceed? I can rewrite the PR to be fully backwards compatible, but it will complicate the code a bunch. Or, can we do an in between PR with a deprecation warning on the individual metrics before this one is released? Or something else?

craffel commented 4 years ago

The way the individual metrics are written in the current release makes them hard to use individually - if you want to call them using "standard" melody outputs (frequencies in Hz, negative frequencies indicating voicing, different time scales) you have to first manually call melody.to_cent_voicing and before the individual metrics would work properly.

I agree that it's cumbersome, but it's not unheard of in mir_eval, e.g. all the beat metrics require that you call trim_beats first e.g. https://github.com/craffel/mir_eval/blob/master/mir_eval/beat.py#L193

How should I proceed? I can rewrite the PR to be fully backwards compatible, but it will complicate the code a bunch. Or, can we do an in between PR with a deprecation warning on the individual metrics before this one is released? Or something else?

We haven't handled deprecation before in mir_eval. It's especially hard since they are being used as positional arguments. @bmcfee @lostanlen do you all have any experience/tips on how to do this from librosa?

lostanlen commented 4 years ago

We haven't handled deprecation before in mir_eval. It's especially hard since they are being used as positional arguments. @bmcfee @lostanlen do you all have any experience/tips on how to do this from librosa?

I don't remember librosa having changed the unit of one ot its argument. As far as i can tell, all we've done is remove kwargs and remove function names. Here, the only option i see is (sigh):

1) give a new name to the proposed validate, e.g. validate_cents. 2) copy-paste validate into a new name, e.g. validate_hertz. 3) raise a Warning every time the old validate is being deprecated, and say that validate will be in cents in the future. Users should rename their callsite to validate_hertz( if they want to retain the old behavior, and validate_cents( if they want to adopt the new behavior. 4) make a new release 5) remove the validate function 6) make a new release. 7) wait a bit. 8) People complain about NameError: name 'validate' is not defined. We point them to the right direction. 9) make a new release 10) add a DeprecationWarning to validate_hertz 11) copy-paste validate_cents into a new name: validate 12) add a DeprecationWarning to validate_cents (renamed to validate) 13) make a new release 14) wait a long time. 15) Remove validate_hertz and validate_cents 16) make a new release 17) 🤞

craffel commented 4 years ago

Yes, that sounds about right, except we'd need to do that for all of the metric functions (not just validate).

rabitt commented 4 years ago

@craffel @lostanlen Thanks for all your comments. I finally found some time to get back to this - I've redone this PR and it should now be fully backwards compatible, and ready for a re-review.

Note this is still missing tests for non-binary voicing, but wanted to first check that the change is otherwise OK.

craffel commented 4 years ago

Any melody people want to chime in? @juanjobosch @justinsalamon

lostanlen commented 4 years ago

"melody people" 😍

supporting information attached: https://www.youtube.com/watch?v=6Sl-ad304Aw

craffel commented 4 years ago

a wild funky appears

juanjobosch commented 4 years ago

Hey @craffel, sure! I can have a look in a couple of days.

rabitt commented 4 years ago

Just added a bunch of tests (which caught a baddddd bug in the new code), so pending review, this is done on my side!

justinsalamon commented 4 years ago

I'm tracking this, but totally swamped right now so @juanjobosch will probably get to this sooner than I do, but if not I'll give it a pass as soon as I can.

juanjobosch commented 4 years ago

A first remark, which I think we commented briefly offline some time ago, but may be worth mentioning it here too.

In the generalised metrics article, voicing-related accuracies, only consider non-binary voicing for the estimates (not for the annotations, and that's why in this implementation it uses ref_indicator, as a binary indicator of ref_voicing). And the formulas of pitch-related accuracies (RPA, RCA, OA) consider a "reward" for correct pitch and chroma estimates, which could be made equivalent to ref_voicing, as it is done in this implementation.

I think it may be ok to maintain the current terminology, as it probably makes it easier to keep the conceptual relation to the classic metrics (and therefore keep using e.g. "ref_voicing" in the RPA denominator, instead of e.g. "pitch_estimation_reward"), but I think it would be worth to at least add a comment in the functions where the terms in the paper's formula is different than the implementation (RPA, RCA, OA) saying something like the pitch estimation reward from [ref] is here made equivalent to ref_voicing.

rabitt commented 4 years ago

@craffel I think this should be good to go now. Just noticing - what happened to this repo's tests? I'm seeing tests run on my fork, but not on this PR.

craffel commented 4 years ago

It looks like it's running on your PR but Travis isn't auto-commenting or anything. It also looks broken by other stuff :( https://travis-ci.org/github/craffel/mir_eval

craffel commented 4 years ago

327 and #328 should solve two of the Travis errors; please rebase on top of those. There is an additional error in Travis on top of this PR (but not at HED) that comes from the output of display.piano_roll changing

======================================================================
FAIL: test_display.test_pianoroll.test
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/miniconda/envs/test-environment/lib/python3.5/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/travis/build/craffel/mir_eval/tests/mpl_ic.py", line 56, in failer
    result = f(*args, **kwargs)
  File "/home/travis/build/craffel/mir_eval/tests/mpl_ic.py", line 219, in do_test
    ' (RMS %(rms).3f)' % err)
matplotlib.testing.exceptions.ImageComparisonFailure: images not close: /home/travis/build/craffel/mir_eval/tests/result_images/test_display/piano_roll.png vs. /home/travis/build/craffel/mir_eval/tests/result_images/test_display/piano_roll-expected.png (RMS 26.578)

Mind checking if this PR would cause that?

rabitt commented 4 years ago

@craffel I rebased on master and found the culprit of the error you mentioned. I updated the internals of some of the mir_eval.display functions to fix this. Everything is happily passing now :)

craffel commented 4 years ago

Looks good, thanks so much for your hard work on this @rabitt !