Clarification about training of dragon

Dundalia commented 1 year ago

In the paper of DRAGON is stated that the training samples are triplets of, given a source of supervision: query, one sampled document from top 10 retrieved documents, and one sampled from the top 41-50 documents.

In the code, specifically in dpr_scale.task.dpr_task, it seems that the CrossEntropyLoss is computed taking the scores of 2*batch_size documents, so that the training samples are actually far bigger than simple triplets, as the model is seeing in batch negatives as well.

I am confused at this point. Is the code not reproducing the paper or is the paper not clear enough?

ccsasuke commented 1 year ago

Hi @Freddavide, in-batch negatives are indeed used in DRAGON. We'll update the paper in the next revision to make this clearer.

For a more detailed discussion on the DRAGON sampling and training approach at each iteration, you can also refer to our earlier paper, SPAR (https://arxiv.org/abs/2110.06918).

Dundalia commented 1 year ago

Thanks for the clarification. I am sorry to annoy you, but I still have some doubts. In SPAR is stated that you have observed that a higher amount of positive passages during training leads to better performances. I have some questions now:

Is it the case that the negative log-lokelihood is computed using as positive passages all and only the top n_p retrieved by the sparse retriever and as negative all and only the bottom n_n negatives? In such a case, there are not in batch negatives in SPAR.
As far as I understood, the training samples of dragon are built as follows. each batch is composed of batch_size questions. For each question, one positive and one negative document are sampled by the respective pools of positives and negatives (Top10 and Top46-50 documents). At this point, for each query in the batch, the loss function is computed using as positive score the score between the query and the respective positive document, and as negative scores, the scores between the query and the respective negative document, but also between the query and all the other documents in the batch. So that for a single query, there will be 1 positive and (batch_size * 2 - 1) negatives. Is it correct?
If it is the case, when you say that a greater amount of positives leads to better performances, you do not mean a greater amount of positives in the same training sample, but reusing the same query in different (possibly distant in the training loop) training samples. Is it correct?
And again, if it is the case, the batch_size has a strong semantical meaning during training. Have you used a batch_size of 64 as specified in the dragon_aws.yaml file?

Thanks in advance

ccsasuke commented 1 year ago

Hi @Freddavide, great questions an I can totally see where the confusion came from -- it shows how we could improve the writing of our paper to make it clearer in the next revision.

SPAR training is similar to the original DPR training when multiple positive and/or hard negatives are given for each sample. In particular, only one positive and one HN is randomly selected for each sample at each epoch, and the training uses standard in-batch negatives, where the loss is not only computed on the two passages from the sample itself, but also all the other passages in the batch.
Yes, this is correct.
Sorry, not sure I understand. What we meant by "a greater amount of positives leads to better performances" is that: when you provide multiple positive passages for each sample, all of them will be eventually used during training over multiple epochs. (If there's only one positive, in each epoch the query has the same positive passage; but if there're 10, it's possible that different passages are picked for the first 10 epochs. It's effectively increasing the training data size, especially if you consider in-batch negatives as well.)
I believe dragon_aws.yaml has the hyperparameters we used in training, and IIRC we did use a per-gpu batch size of 64.

Dundalia commented 1 year ago

Thanks for the exhaustive answer!! Crystal clear now!!

facebookresearch / dpr-scale

Clarification about training of dragon #16