Miscalculated GQA score?

e-bug / volta

[TACL 2021] Code and data for the framework in "Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs"

https://aclanthology.org/2021.tacl-1.58/

MIT License

114 stars 24 forks source link

Miscalculated GQA score? #13

Closed eladsegal closed 2 years ago

eladsegal commented 2 years ago

Hi, First - thank you for this great work and repo, it is extremely helpful!

I trained a model on GQA, and it looks like there's a mistake in the calculation of GQA_score: --truth_file is testdev_balanced_questions.json (as used in test.sh) where each entry has "answer" (of type string) as the truth label, and the accuracy check is whether the prediction is contained in the truth label. https://github.com/e-bug/volta/blob/0d194f1ce4bfc1bf0a48a3da5f5cf7cd5391f917/scripts/GQA_score.py#L12

This means that for a truth label of "woman", a prediction of either "man" or "woman" would get a full score.

According to the official GQA evaluation script, the accuracy check should be

if pred == label:

After making this change I got a score lower by 2.69 points.

e-bug commented 2 years ago

Hi Elad,

Yes, it's definitely as you say. Initially, label was a list (of answers) and I didn't update the clean code. I'll fix it right away, thanks a lot!

eladsegal commented 2 years ago

Thanks! So just to make sure, the GQA numbers in figure 5 of the paper were calculated correctly?

Edit: Seems like they were calculated without the fix. I ran https://github.com/e-bug/volta/blob/main/examples/vilbert/gqa/train.sh and got 58.28 before the fix and 55.56 after, while the score in figure 5 is 58.39 (got it by checking the distances) .

e-bug commented 2 years ago

Thanks for the feedback! I will try re-evaluate our checkpoints with the fix.

Also, if you are working on GQA, in my experience it was extremely slow to train. I've just implemented a faster dataloader, similar to the one for Conceptual Captions. I'll push it in the next couple of weeks :)

eladsegal commented 2 years ago

Thanks!

Regarding the GQA dataloader, that sounds interesting. What was the bottleneck in the current one? I actually never used it as is. I changed the image features reader to load the .tsv directly to the memory, without LMDB.

e-bug commented 2 years ago

Oh, interesting! Would you mind sharing it (either here or by email)?

In my case, LMDB is slow when there are many images. Both Flickr30K is totally fine, NLVR2 is only minimally affected but GQA is extremely slow.

eladsegal commented 2 years ago

Sure, here it is: https://gist.github.com/eladsegal/5b3974ed0ddade5a75eb3db9ebc7d2b7 It does require a machine with a lot of memory, almost 50GB per GPU used.

I also use LMDB for GQA in another codebase and the speed was good when it was on a local disk, and slow when it was on another machine connected via SSHFS. I'll also try to share it, but it will need some cleanup before as it has dependencies to another code of mine.

e-bug commented 2 years ago

Thanks!

Oh yeah, we only have a NFS, which is probably why it's not fast enough. Storing the dataset in RAM is also not an option because of shared resources and SLURM 🙃

The new dataloader with prefetching is much faster for GQA on NFS though.