facebookresearch / dpr-scale

Scalable training for dense retrieval models.
262 stars 25 forks source link

Clarification about training of dragon #16

Closed Dundalia closed 1 year ago

Dundalia commented 1 year ago

In the paper of DRAGON is stated that the training samples are triplets of, given a source of supervision: query, one sampled document from top 10 retrieved documents, and one sampled from the top 41-50 documents.

In the code, specifically in dpr_scale.task.dpr_task, it seems that the CrossEntropyLoss is computed taking the scores of 2*batch_size documents, so that the training samples are actually far bigger than simple triplets, as the model is seeing in batch negatives as well.

I am confused at this point. Is the code not reproducing the paper or is the paper not clear enough?

ccsasuke commented 1 year ago

Hi @Freddavide, in-batch negatives are indeed used in DRAGON. We'll update the paper in the next revision to make this clearer.

For a more detailed discussion on the DRAGON sampling and training approach at each iteration, you can also refer to our earlier paper, SPAR (https://arxiv.org/abs/2110.06918).

Dundalia commented 1 year ago

Thanks for the clarification. I am sorry to annoy you, but I still have some doubts. In SPAR is stated that you have observed that a higher amount of positive passages during training leads to better performances. I have some questions now:

Thanks in advance

ccsasuke commented 1 year ago

Hi @Freddavide, great questions an I can totally see where the confusion came from -- it shows how we could improve the writing of our paper to make it clearer in the next revision.

  1. SPAR training is similar to the original DPR training when multiple positive and/or hard negatives are given for each sample. In particular, only one positive and one HN is randomly selected for each sample at each epoch, and the training uses standard in-batch negatives, where the loss is not only computed on the two passages from the sample itself, but also all the other passages in the batch.

  2. Yes, this is correct.

  3. Sorry, not sure I understand. What we meant by "a greater amount of positives leads to better performances" is that: when you provide multiple positive passages for each sample, all of them will be eventually used during training over multiple epochs. (If there's only one positive, in each epoch the query has the same positive passage; but if there're 10, it's possible that different passages are picked for the first 10 epochs. It's effectively increasing the training data size, especially if you consider in-batch negatives as well.)

  4. I believe dragon_aws.yaml has the hyperparameters we used in training, and IIRC we did use a per-gpu batch size of 64.

Dundalia commented 1 year ago

Thanks for the exhaustive answer!! Crystal clear now!!