Closed simpleshinobu closed 4 years ago
The train and dev set are all collected from Visual Genome thus could not be a faithful indicator of the out-of-domain performance (the GQA test set). The GQA challenge holds another validation set testdev
set for this purpose, which are collected from test set of MS COCO (not shared by VG). Please use this testdev
to tune the model as possible.
The results on valid set should be around 70%. 80% is due to the valid set is used as a part of training data in LXMERT since we would use testdev
.
More details here: https://cs.stanford.edu/people/dorarad/gqa/evaluate.html
when I set --train train --valid valid for python src/tasks/gqa.py, I find that the accuracy for valid set is 80%. what wrong with this setting and how to evaluate the result on the valid set. Thank you!