JamesOwers / midi_degradation_toolkit

A toolkit for generating datasets of midi files which have been degraded to be 'un-musical'.
MIT License
38 stars 5 forks source link

"Rule-based" baselines #71

Closed JamesOwers closed 4 years ago

JamesOwers commented 5 years ago

Currently we don't know how good our baselines are against super dumb baselines:

We should probably evaluate these for ourselves at least before releasing our baselines (which will look a bit silly if they don't win!). Any other dumb ones to propose?

apmcleod commented 5 years ago
apmcleod commented 5 years ago

This comment is in theory. Next comment is in practice.

I'll keep updating this comment with results. Note that these are at the moment, just calculated directly from these probability distributions. The test set should presumably be similar, but small variations may be present.

Task 1:

Task 2:

Task 3 (Predict avg nr of 1s everywhere):

p(0) = 0.93841816592
p(1) = 0.06158183408

Task 4 (Use p(1|0), p(1|1), etc.):

p(0) = 0.99153220454
p(0|0) = 0.99999999197
p(1|0) = 0.00000000803
p(1) = 0.00846779546
p(0|1) = 0.04216815317
p(1|1) = 0.95783184683
apmcleod commented 5 years ago

The above numbers are in theory. The below are in practice:

Task 1:

Task 2:

Task 3 (Predict avg nr of 1s everywhere):

p(0) = 0.93841816592
p(1) = 0.06158183408

Task 4 (Use p(1|0), p(1|1), etc.):

p(0|0) = 0.99999999197
p(1|0) = 0.00000000803
p(0|1) = 0.04216815317
p(1|1) = 0.95783184683
JamesOwers commented 5 years ago

Either sort out maths of Task 4 dumb baseline, or rerun empirical on ACME1.0

apmcleod commented 5 years ago

Switched milestone because I have run on the current acme and that is in the submitted paper, but we should redo the "in practice" numbers with official ACME 1.0, when we have it.

JamesOwers commented 4 years ago

Maybe I'll make the maths work. @apmcleod - rerun baselines on real ACME dataset.

apmcleod commented 4 years ago

ACME v1.0 numbers:

Task 1-3 number gotten from adding print(counts / np.sum(counts)) to get_inverse_weights and running each task with --weight. Should we make this doable without editing the code? Maybe not important.

Task 1: [p(0) p(1)] = [0.11100006 0.88899994] Loss = 0.4685 Rev-F = 0

Task 2: [p(0) p(1) ... p(8)] = [0.11100006 0.11105558 0.11127769 0.11138875 0.11111111 0.11094453 0.10938975 0.11133322 0.11249931] Loss = 2.1972 Acc = 0.11639271434917814

Task 3: [p(0) p(1)] = [0.93126526 0.06873474] Loss = ??? F-1 = ???

Task 4: TODO

apmcleod commented 4 years ago

I wrote a script for this. Will push to a "rule" branch. Output:

Task 1 loss = 0.4685294032096863 Task 1 rev F-measure = 0.0 Task 2 loss = 2.1972384452819824 Task 2 acc = 0.11639271434917814 Task 3 loss = 0.4128188192844391 Task 3 F-measure = 0.0 Task 4 loss = 0.6863518953323364 Task 4 Helpfulness = 0.558196357174589