Georgetown-IR-Lab / cedr

Code for CEDR: Contextualized Embeddings for Document Ranking, accepted at SIGIR 2019.
MIT License
156 stars 28 forks source link

train data random choice #10

Closed JrJessyLuo closed 5 years ago

JrJessyLuo commented 5 years ago

I found that in every training epoch,16*32 training pairs are randomly selected from the training data. I wonder if the random choice would affect the model learning because the training data changed every epoch. Also, I wonder whether the small amount of training data would be effective in evaluation and test set.

seanmacavaney commented 5 years ago

Hi JrJessyLuo,

The batch size and evaluation interval are in line with other work done for neural ranking. Since we're training with pairwise loss, as is common for training neural rankers, it would be impractical and pretty redundant to train on every pair between evaluations (in Robust04, it comes out to ~22m pairs over the whole collection). We welcome additional experimentation with training and validation strategies, though!

Can you elaborate more on your second question?

JrJessyLuo commented 5 years ago

Hi seanmacavaney, Thank you for your reply. In fact, I want to know how to tune parameters(like learning rate and epochs) according to your train and evaluation dataset. From your code, I found the training dataset combined with 16*32 pairs to minimize the pairwise loss, then using all queries and documents in evaluation dataset to save the best result model.

My question is:

  1. Each epoch is formed with different training pairs, are you mean to avoid overfitting with the small number of training pairs? In general, the training dataset is fixed. So each epoch should come up with the same training pairs.
  2. Have you considered turning the parameters based on the pairwise loss with training set and evaluation set? If so, how do we choose the training and evaluation size?

I have a lot of problems in the training and validation strategies.

seanmacavaney commented 5 years ago

This is a typical strategy when training neural ranking models. You are, of course, free to experiment with different hyper-parameters such as the number of samples per training epoch, number of samples in the validation set, etc. You can even use all 22m pairs for training, although this would take a long time to run and is probably unnecessary because we've found that the model can be trained effectively with far fewer samples than that. You can also try sampling the same training pairs each epoch as it seems you are suggesting, but I suspect that this would lead to overfitting.

andrewyates commented 5 years ago

I think part of the confusion here is over the definition of epoch. Here "epoch" means "iteration," not the technically correct definition of "an iteration consisting of every training instance." This is in line with how many (most?) papers use the term "epoch", but it is confusing.

I've experimented with training vanilla BERT for longer than 512 sample iterations. Increasing the size of an iteration to 8192 already harms performance on the dev set, so I don't think training for a full, "real epoch" would lead to good performance.