Closed JrJessyLuo closed 5 years ago
Hi JrJessyLuo,
The batch size and evaluation interval are in line with other work done for neural ranking. Since we're training with pairwise loss, as is common for training neural rankers, it would be impractical and pretty redundant to train on every pair between evaluations (in Robust04, it comes out to ~22m pairs over the whole collection). We welcome additional experimentation with training and validation strategies, though!
Can you elaborate more on your second question?
Hi seanmacavaney, Thank you for your reply. In fact, I want to know how to tune parameters(like learning rate and epochs) according to your train and evaluation dataset. From your code, I found the training dataset combined with 16*32 pairs to minimize the pairwise loss, then using all queries and documents in evaluation dataset to save the best result model.
My question is:
I have a lot of problems in the training and validation strategies.
This is a typical strategy when training neural ranking models. You are, of course, free to experiment with different hyper-parameters such as the number of samples per training epoch, number of samples in the validation set, etc. You can even use all 22m pairs for training, although this would take a long time to run and is probably unnecessary because we've found that the model can be trained effectively with far fewer samples than that. You can also try sampling the same training pairs each epoch as it seems you are suggesting, but I suspect that this would lead to overfitting.
I think part of the confusion here is over the definition of epoch. Here "epoch" means "iteration," not the technically correct definition of "an iteration consisting of every training instance." This is in line with how many (most?) papers use the term "epoch", but it is confusing.
I've experimented with training vanilla BERT for longer than 512 sample iterations. Increasing the size of an iteration to 8192 already harms performance on the dev set, so I don't think training for a full, "real epoch" would lead to good performance.
I found that in every training epoch,16*32 training pairs are randomly selected from the training data. I wonder if the random choice would affect the model learning because the training data changed every epoch. Also, I wonder whether the small amount of training data would be effective in evaluation and test set.