Questions about the reported HM number (60.5) of CGE on UT-Zap50K?

HeimingX commented 3 years ago

Hi,

Thanks for the impressive paper and high quality open-resourced codebase.

I ran an experiment on zappos with this codebase (w/o any editing) and found that there is a big gap between the testset HM number(47.11) of CGE to the reported one(60.5) in the paper. The following is the eval log on test set

Furthermore, from this log, although the test auc number is close to the reported one(33.5), but the best unseen(66.05) also has a big gap to the reported one(71.5).

I am a little bit confused about these number gaps and could you please be kind to give some explainations. Thanks a lot.

ferjad commented 3 years ago

@HeimingX thanks for the interest in our work and the kind words. Regarding metrics, AUC is the most stable one since it measures the relative change between seen and unseen accuracy. The best harmonic mean is dependent on the bias each run has between seen and unseen accuracy. The discretization of the difference to find bias points can make the results vary and can hence affect the numbers. We found this to be most prevalent on UT Zappos. We reported the best number we got across multiple runs consistent with older works. If you plan on working on this topic, I will recommend sticking with either MIT-States or C-GQA and extend it to UT-Zappos afterward. Some of UT-Zappos states like Leather vs Synthetic leather are material differences that are not always visible as visual transformations, we discussed this in our paper.

HeimingX commented 3 years ago

Hi Ferjad,

Thanks for the timely response and detailed explaination.

However, I still have some concerns:

Actually, I have run multiple times on UT Zappos with CGE method(since it is quick to arrive at the performance peak and seems to overfit after that) but none of these runs acheive such high HM number (most of the results are around 50). I wonder if it is possible for you to publish the model ckpt that you have achieved the reported number? And since the results tend to have a big variance, mean&std seems to be a necessary metric.
Regarding the suggestions about datasets, MIT-States is argued to have label noises(both in CGE paper and [1]) and the newly proposed dataset C-GQA seems to have an incomplete training set (as proposed in this issue): 1) In training data, 1371 pairs out of 6963 train pairs have no data. 40 out of 453 attributes have no data and 196 out of 870 objects have no data. 2) In the validation set, 133 pairs comes from the training pairs w/o training data. 3) In the test set, 134 pairs comes from the training pairs w/o training data. From my view, it would be hard for a model to generalize to a new composition without seeing the corresponding attribute or object before and it seems to be beyond the scope of current research on czsl. I am not sure if I have understanded it corrrectly, could you please give more explainations. Thanks a lot.

[1]: A causal view of compositional zero-shot recognition

zhaohengz commented 3 years ago

Hi Heiming,

If you don't mind me asking, would you be able to replicate the HM results on Zappos? I tried several times and also got ~50

Thanks, Zhaoheng

HeimingX commented 3 years ago

Hi,

Not yet.

I just found the reported testset HM number is some how close to the results on validation set but the reported/reproduced AUC on test set are comparable. It is quite weird. Not sure if any errorediting happens in the paper.

Zappos	AUC	HM	Seen	Unseen
test set (reported)	33.5	60.5	64.5	71.5
val set(reported)	43.2	-	-	-
test set (reproduced)	33.4	48.3	61.9	67.9
val set(reproduced)	41.4	56.9	63.6	71.2

Look forward to the author's feedback~

ExplainableML / czsl

Questions about the reported HM number (60.5) of CGE on UT-Zap50K? #4