Onset experiment - Githubissues

craffel commented 10 years ago

The onset experiment is pretty much done - a pretty simple one. The code currently just grabs the results from MIREX 2013, parses them into my preferred format, and compares.

The MIREX onset reference data is annotated by at least one person, up to three (seems like most are three). The three metrics (f-measure, precision, recall) are just averaged across all reference annotations. So, mir_eval evaluates against all reference annotations and computes the mean.

Currently, it's close:

diff = np.abs(mirex_scores - np.round(mir_eval_scores, 3))
diff[np.less_equal(diff, .0010001)] = 0
print np.sum(diff, axis=0)/np.sum(mirex_scores, axis=0)
print np.sum(diff, axis=0)/mirex_scores.shape[0]

# F-measure, precision, recall
[ 0.00700207  0.00633325  0.00737894]
[ 0.00521112  0.00499519  0.00562332]

The metrics are reported on a 0->1 scale, so the absolute and relative errors are all below 1%. We're not expecting them to be perfect because we're not using a greedy matching strategy. There may be other differences too, but we can't really tell right now as we don't expect our code to do the same thing. It would be interesting to run the experiment before mir_eval commit 8e9d8b3ec37c4d4f1094295feef0bc61ad716f0b where we switched to the non-greedy strategy to see if it's actually the same.

craffel commented 10 years ago

Out of curiosity, I tried it with the original greedy strategy, and the results are identical. Worth figuring out why.

craffel commented 10 years ago

I tried the following, which is as best as I can tell the same as it's done in MIREX (see lines 248-343 https://code.google.com/p/nemadiy/source/browse/analytics/trunk/src/main/java/org/imirsel/nema/analytics/evaluation/onset/OnsetEvaluator.java )

    correct = 0.0
    count = 0
    for onset in reference_onsets:
        for n in xrange(count, estimated_onsets.shape[0]):
            if np.abs(estimated_onsets[n] - onset) < window:
                correct += 1
                count = n + 1
                break
            elif estimated_onsets[n] > (onset + window):
                count = n
                break
    precision = correct/estimated_onsets.shape[0]
    recall = correct/reference_onsets.shape[0]

The results are almost identical.

# F-measure, precision, recall
[ 0.00731579  0.00636987  0.00765499]
[ 0.0054446   0.00502406  0.00583369]

craffel commented 10 years ago

Removing the validation code which ensures that onsets are non-negative makes the metrics very close:

# F-measure, precision, recall
[ 0.00106689  0.00110393  0.00111615]
[ 0.00079401  0.0008707   0.00085059]

Because of the way the code is written (greedy) the onsets should definitely be sorted, and negative onsets don't really make any sense... I think. Weirdly, if the comparison

elif estimated_onsets[n] > (onset + window):

is changed to >= (which doesn't match MIREX), it's even closer

[ 0.00062226  0.00060437  0.0007295 ]
[ 0.0004631   0.00047668  0.00055594]

I'm not sure I will be able to make it exact. But it certainly seems like the greedy vs. non-greedy makes relatively little difference compared to removing the negative onsets -- using mir_eval.util.match_events without pruning negative onsets gives

[ 0.00070087  0.00104603  0.00078101]
[ 0.0005216   0.00082503  0.00059519]

craffel commented 10 years ago

After playing around a bit, I'm pretty ready to conclude that most of the remaining difference comes from Java vs. Python handling float precision differently. E.g., a < b might evaluate differently in Java and Python when $a \approx b$. Either way, if we stick to the non-greedy implementation the error is acceptably low enough.

However, still need to decide if mir_eval should allow events < 0.

craffel commented 10 years ago

OK, as of https://github.com/craffel/mir_eval/commit/6404010e66d6eb508c4600aee78925f7357fd3ac negative events no longer need to be trimmed. The onset experiment has been updated so that onsets are no longer trimmed. It uses match_events and the error is as reported above

# F-measure, precision, recall
[ 0.00070087  0.00104603  0.00078101]
[ 0.0005216   0.00082503  0.00059519]

craffel / mir_eval-ismir

Onset experiment #3