training data size - Githubissues

alontalmor / MultiQA

139 stars 23 forks source link

training data size #18

Closed danyaljj closed 4 years ago

danyaljj commented 4 years ago

Somewhere in the paper you say:

we will fix the size of the large datasets to control for size effects, and always train on exactly
75K examples per dataset.

Looking at the command-line it doesn't look like you have any way of specifying size of the training data. I also couldn't find anywhere in the code where you take a subset of the training data. Just wanted to confirm that the code uses all of the training data.

alontalmor commented 4 years ago

Please see in the models directory readme, last command for training in AllenNLP: python -m allennlp.run train models/MultiQA_BERTBase.jsonnet -s ../Models/MultiTrain -o "{'dataset_reader': {'sample_size': 75000}, ... {'sample_size': 75000} in the DatasetReader trains only on a sub-sample of the training set of size 75,000 examples.