facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.49k stars 6.41k forks source link

Training details of the RoBERTa model on the ReCoRD dataset #1815

Closed ikuyamada closed 4 years ago

ikuyamada commented 4 years ago

I read the issue #1598 and the updated paper of the RoBERTa model, and would like to ask questions regarding the training details on the ReCoRD dataset.

The paper mentions as follows:

during training we adopt a pairwise ranking formulation with one negative and positive entity for each (passage, query) pair. At evaluation time, we pick the entity with the highest score for each question.

ngoyal2707 commented 4 years ago

1) We are still using nll_loss from pytorch, something like following, where net_input1 is context concatenated with answer candidate 1 and net_input2 is with answer candidate 2 :

        features1, extra1 = model(**sample['net_input1'], features_only=True)
        features2, extra2 = model(**sample['net_input2'], features_only=True)

        logits1 = model.sentence_classification_head(features1)
        logits2 = model.sentence_classification_head(features2)

        logits = torch.cat([logits1, logits2], dim=1)
        targets = model.get_targets(sample, [logits]).view(-1)
        sample_size = targets.numel()

        loss = F.nll_loss(
            F.log_softmax(logits, dim=-1, dtype=torch.float32),
            targets,
            reduction='sum',
        )

2) All possible negative and positive pairs are selected. There's no sampling happening. I think, this results in ~1.4M samples per epoch.

3) We use lr=1e-5, bsz=32, we also use dropout=0.2 in classification_head.

Let me know, if some details are unclear

ikuyamada commented 4 years ago

@ngoyal2707 Thank you for your prompt reply!

All possible negative and positive pairs are selected. There's no sampling happening. I think, this results in ~1.4M samples per epoch.

Does this mean that, given an example, all combinations of positive and negative answers are selected? For example, if the example contains two positive answers A and B, and two negative answers C and D, are all combinations, namely (A, C), (A, D), (B, C), (B, D) generated from the example? If so, are all generated positive-negative pairs treated as independent training instances that are shuffled during training?