specialized_vocab_train.csv vs large_vocab_train.csv

Hey, thanks!

We use the large_vocab_train.csv file to train the large-scale pertained models (from Sec. 5.2 of our paper) and specialized_vocab_train.csv for the specialized models (Sec. 5.4).

We detail the process of generating these vocabularies in our appendices (Appx. B and D). In particular:

We produce the vocabulary for the [large-scale pretrained model] experiments in Sec. 5.2 from the training set by selecting all correct choices, as well as all choices and direct answers that appear in at least three questions. This results in a vocabulary with 10,424 answers.

For all of the [specialized] methods in the paper we use a fixed vocabulary constructed from direct answers that appeared two or more times in the training set. This includes 2,133 bi-grams or unigrams, with 1,937 words.

allenai / aokvqa

specialized_vocab_train.csv vs large_vocab_train.csv #1