Definitions in evaluation script

ftyers commented 6 years ago

We have been trying a random baseline, and are confused with the numbers we are getting for precision/recall/F-score.

The evaluation script is returning 0.66 for F-score for the random baseline, which seems a bit odd. Suppose we have a truth file:

a,b,c,1
a,b,c,0
a,b,c,1
a,b,c,0
a,b,c,1
a,b,c,0
a,b,c,1
a,b,c,0
a,b,c,1
a,b,c,0

And a random selection from the classifier:

a,b,c,0
a,b,c,1
a,b,c,1
a,b,c,1
a,b,c,1
a,b,c,0
a,b,c,1
a,b,c,0
a,b,c,0
a,b,c,1

In principle the F-score should be around 0.5, but we get 0.66. We think this is possibly because of how the true/false positives/negatives are calculated.

Based on how we calculate the false positives/negatives we should calculate the true positives/negatives in the same way. Right now we count both true positives and true negatives as true positives, whereas false negatives/positives are split.

Perhaps the evaluation could calculate the numbers for both classes and average ? Or alternatively perhaps the Evaluation page on the CodaLab could be more specific with how these are calculated (i.e. that the evaluation isn't necessarily conducted in the way that might be expected from the name).

ftyers commented 6 years ago

This quick hack calculates the average of the F-score for both classes: https://paste2.org/BUAKkfIF

Run on the example above:


$ python3 evaluation.py . .
tp :3
tn :2
fp :2
fn :3
p  :0.6
r  :0.5
f1 :0.5454545454545454
--
tp :2
tn :3
fp :3
fn :2
p  :0.4
r  :0.5
f1 :0.4444444444444445
--
f1 :0.4949494949494949

dpaperno commented 6 years ago

Thank you for pointing out this issue. We have fixed the evaluation script. The current version should work correctly, as in your "quick hack" example.

dpaperno / DiscriminAtt

Definitions in evaluation script #2