"Rule-based" baselines - Githubissues

JamesOwers commented 5 years ago

Currently we don't know how good our baselines are against super dumb baselines:

Task 1: [1/9, 8/9] for everything
Task 2: [1/9, ..., 1/9] for everything
Task 3:
- get average 'changed region' length and just pick the section in the middle
- predict avg nr of 1s everywhere
Task 4: do nothing

We should probably evaluate these for ourselves at least before releasing our baselines (which will look a bit silly if they don't win!). Any other dumb ones to propose?

apmcleod commented 5 years ago

Task 3: Predict all 0's would probably be more accurate, similar to task 4 baseline.

apmcleod commented 5 years ago

This comment is in theory. Next comment is in practice.

I'll keep updating this comment with results. Note that these are at the moment, just calculated directly from these probability distributions. The test set should presumably be similar, but small variations may be present.

Task 1:

F-measure = 0.94
Reverse F-measure (swap 1s and 0s) = 0.00
Loss = 8/9 * CrossEnt([1/9, 8/9], 1) + 1/9 * CrossEnt([1/9, 8/9], 0) = 0.4645

Task 2:

Acc = 0.1111
Loss = CrossEnt(np.ones(9) / 9, 0) = 2.1972

Task 3 (Predict avg nr of 1s everywhere):

p(0) = 0.93841816592
p(1) = 0.06158183408

F-measure = 0.00
Loss = p(0) * CrossEnt([p(0), p(1)], 0) + p(1) * CrossEnt([p(0), p(1)], 1) = 0.4019

Task 4 (Use p(1|0), p(1|1), etc.):

p(0) = 0.99153220454
p(0|0) = 0.99999999197
p(1|0) = 0.00000000803
p(1) = 0.00846779546
p(0|1) = 0.04216815317
p(1|1) = 0.95783184683

Helpfulness = 0.5 * 8/9 + 1 * 1/9 = 0.5556
Loss = p(0) * p(0|0) * CrossEnt([p(0|0), p(1|0)]), 0) + p(0) * p(1|0) * CrossEnt([p(0|0), p(1|0)]), 1) + p(1) * p(0|1) * CrossEnt([p(0|1), p(1|1)]), 1) + p(1) * p(1|1) * CrossEnt([p(0|1), p(1|1)]), 1) = 0.3138

apmcleod commented 5 years ago

The above numbers are in theory. The below are in practice:

Task 1:

F-measure = 0.940
Reverse F-measure (swap 1s and 0s) = 0.000
Loss = 0.466

Task 2:

Acc = 0.113
Loss = 2.197

Task 3 (Predict avg nr of 1s everywhere):

p(0) = 0.93841816592
p(1) = 0.06158183408

F-measure = 0.000
Loss = 0.404

Task 4 (Use p(1|0), p(1|1), etc.):

p(0|0) = 0.99999999197
p(1|0) = 0.00000000803
p(0|1) = 0.04216815317
p(1|1) = 0.95783184683

Helpfulness = 0.590
Loss = 0.690

JamesOwers commented 5 years ago

Either sort out maths of Task 4 dumb baseline, or rerun empirical on ACME1.0

apmcleod commented 5 years ago

Switched milestone because I have run on the current acme and that is in the submitted paper, but we should redo the "in practice" numbers with official ACME 1.0, when we have it.

JamesOwers commented 4 years ago

Maybe I'll make the maths work. @apmcleod - rerun baselines on real ACME dataset.

apmcleod commented 4 years ago

ACME v1.0 numbers:

Task 1-3 number gotten from adding print(counts / np.sum(counts)) to get_inverse_weights and running each task with --weight. Should we make this doable without editing the code? Maybe not important.

Task 1: [p(0) p(1)] = [0.11100006 0.88899994] Loss = 0.4685 Rev-F = 0

Task 2: [p(0) p(1) ... p(8)] = [0.11100006 0.11105558 0.11127769 0.11138875 0.11111111 0.11094453 0.10938975 0.11133322 0.11249931] Loss = 2.1972 Acc = 0.11639271434917814

Task 3: [p(0) p(1)] = [0.93126526 0.06873474] Loss = ??? F-1 = ???

Task 4: TODO

apmcleod commented 4 years ago

I wrote a script for this. Will push to a "rule" branch. Output:

Task 1 loss = 0.4685294032096863 Task 1 rev F-measure = 0.0 Task 2 loss = 2.1972384452819824 Task 2 acc = 0.11639271434917814 Task 3 loss = 0.4128188192844391 Task 3 F-measure = 0.0 Task 4 loss = 0.6863518953323364 Task 4 Helpfulness = 0.558196357174589

JamesOwers / midi_degradation_toolkit

"Rule-based" baselines #71