Biased evaluation metric?

lugiavn commented 5 years ago

Hi, I did some experiment on MIT-States. From my understanding, this is basically a kind of zero shot learning, and it is branded as "unseen combination" recognition task by Red Wine paper. I have a concern about the paper/task (not the code): the "open world" recognition accuracy is actually a biased measurement because the test set consists of "unseen combination" only. A correct test set up should have been "generalized zero shot learning" (e.g. test set contains both seen and unseen concepts)?

Tushar-N commented 5 years ago

Hi Nam, that's an interesting point. We were trying to focus specifically on unseen compositions (a la Red Wine) so we chose the open world setting in addition to the closed world setting. We thought a reasonable metric to account for both settings (which have their own biases respectively) would be the harmonic mean between the two, which we report as well.

It does make sense to have a split like you mentioned as a direct analog to generalized zsl. It becomes a little tricky because the focus shifts from just unseen compositions, and many seen composition classes have very few examples to begin with, but it definitely does make sense. Thanks for bringing this up. I'll include a link to this issue in the readme.

lugiavn commented 5 years ago

Thanks for the response. I've never worked on generalized zsl, but it seems the challenge is that training absorbs a prior distribution on classes (this class is more often than that class). Hence a naive approach might result in a bias (either toward seen classes, or sometimes, against seen classes). A test set of both seen and unseen classes would reveal those biases. So it might be something worth exploring in future works.

Tushar-N / attributes-as-operators

Biased evaluation metric? #6