Training details of the RoBERTa model on the ReCoRD dataset

facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

MIT License

30.49k stars 6.41k forks source link

Training details of the RoBERTa model on the ReCoRD dataset #1815

Closed ikuyamada closed 4 years ago

ikuyamada commented 4 years ago

I read the issue #1598 and the updated paper of the RoBERTa model, and would like to ask questions regarding the training details on the ReCoRD dataset.

The paper mentions as follows:

during training we adopt a pairwise ranking formulation with one negative and positive entity for each (passage, query) pair. At evaluation time, we pick the entity with the highest score for each question.

Which PyTorch loss function is used to implement the pairwise ranking formulation? If the loss function involves hyper-parameters, what are the exact values of them?
An example in the ReCoRD dataset typically has multiple positive and negative entities. When iterating the examples, how does the model select the positive and the negative entity? Does the model select a random entity? If so, for each epoch, how many positive and negative entity pairs are sampled from an example?
What are the hyper-parameters (i.e., learning rate, max epochs) used to train the ReCoRD model?

ngoyal2707 commented 4 years ago

1) We are still using nll_loss from pytorch, something like following, where net_input1 is context concatenated with answer candidate 1 and net_input2 is with answer candidate 2 :

        features1, extra1 = model(**sample['net_input1'], features_only=True)
        features2, extra2 = model(**sample['net_input2'], features_only=True)

        logits1 = model.sentence_classification_head(features1)
        logits2 = model.sentence_classification_head(features2)

        logits = torch.cat([logits1, logits2], dim=1)
        targets = model.get_targets(sample, [logits]).view(-1)
        sample_size = targets.numel()

        loss = F.nll_loss(
            F.log_softmax(logits, dim=-1, dtype=torch.float32),
            targets,
            reduction='sum',
        )

2) All possible negative and positive pairs are selected. There's no sampling happening. I think, this results in ~1.4M samples per epoch.

3) We use lr=1e-5, bsz=32, we also use dropout=0.2 in classification_head.

Let me know, if some details are unclear

ikuyamada commented 4 years ago

@ngoyal2707 Thank you for your prompt reply!

All possible negative and positive pairs are selected. There's no sampling happening. I think, this results in ~1.4M samples per epoch.

Does this mean that, given an example, all combinations of positive and negative answers are selected? For example, if the example contains two positive answers A and B, and two negative answers C and D, are all combinations, namely (A, C), (A, D), (B, C), (B, D) generated from the example? If so, are all generated positive-negative pairs treated as independent training instances that are shuffled during training?