Open craffel opened 10 years ago
Out of curiosity, I tried it with the original greedy strategy, and the results are identical. Worth figuring out why.
I tried the following, which is as best as I can tell the same as it's done in MIREX (see lines 248-343 https://code.google.com/p/nemadiy/source/browse/analytics/trunk/src/main/java/org/imirsel/nema/analytics/evaluation/onset/OnsetEvaluator.java )
correct = 0.0
count = 0
for onset in reference_onsets:
for n in xrange(count, estimated_onsets.shape[0]):
if np.abs(estimated_onsets[n] - onset) < window:
correct += 1
count = n + 1
break
elif estimated_onsets[n] > (onset + window):
count = n
break
precision = correct/estimated_onsets.shape[0]
recall = correct/reference_onsets.shape[0]
The results are almost identical.
# F-measure, precision, recall
[ 0.00731579 0.00636987 0.00765499]
[ 0.0054446 0.00502406 0.00583369]
Removing the validation code which ensures that onsets are non-negative makes the metrics very close:
# F-measure, precision, recall
[ 0.00106689 0.00110393 0.00111615]
[ 0.00079401 0.0008707 0.00085059]
Because of the way the code is written (greedy) the onsets should definitely be sorted, and negative onsets don't really make any sense... I think. Weirdly, if the comparison
elif estimated_onsets[n] > (onset + window):
is changed to >= (which doesn't match MIREX), it's even closer
[ 0.00062226 0.00060437 0.0007295 ]
[ 0.0004631 0.00047668 0.00055594]
I'm not sure I will be able to make it exact. But it certainly seems like the greedy vs. non-greedy makes relatively little difference compared to removing the negative onsets -- using mir_eval.util.match_events
without pruning negative onsets gives
[ 0.00070087 0.00104603 0.00078101]
[ 0.0005216 0.00082503 0.00059519]
After playing around a bit, I'm pretty ready to conclude that most of the remaining difference comes from Java vs. Python handling float precision differently. E.g.,
a < b
might evaluate differently in Java and Python when $a \approx b$. Either way, if we stick to the non-greedy implementation the error is acceptably low enough.
However, still need to decide if mir_eval should allow events < 0.
OK, as of https://github.com/craffel/mir_eval/commit/6404010e66d6eb508c4600aee78925f7357fd3ac negative events no longer need to be trimmed. The onset experiment has been updated so that onsets are no longer trimmed. It uses match_events
and the error is as reported above
# F-measure, precision, recall
[ 0.00070087 0.00104603 0.00078101]
[ 0.0005216 0.00082503 0.00059519]
The onset experiment is pretty much done - a pretty simple one. The code currently just grabs the results from MIREX 2013, parses them into my preferred format, and compares.
The MIREX onset reference data is annotated by at least one person, up to three (seems like most are three). The three metrics (f-measure, precision, recall) are just averaged across all reference annotations. So,
mir_eval
evaluates against all reference annotations and computes the mean.Currently, it's close:
The metrics are reported on a 0->1 scale, so the absolute and relative errors are all below 1%. We're not expecting them to be perfect because we're not using a greedy matching strategy. There may be other differences too, but we can't really tell right now as we don't expect our code to do the same thing. It would be interesting to run the experiment before mir_eval commit 8e9d8b3ec37c4d4f1094295feef0bc61ad716f0b where we switched to the non-greedy strategy to see if it's actually the same.