Closed ikuyamada closed 4 years ago
1) We are still using nll_loss
from pytorch, something like following, where net_input1
is context concatenated with answer candidate 1 and net_input2
is with answer candidate 2 :
features1, extra1 = model(**sample['net_input1'], features_only=True)
features2, extra2 = model(**sample['net_input2'], features_only=True)
logits1 = model.sentence_classification_head(features1)
logits2 = model.sentence_classification_head(features2)
logits = torch.cat([logits1, logits2], dim=1)
targets = model.get_targets(sample, [logits]).view(-1)
sample_size = targets.numel()
loss = F.nll_loss(
F.log_softmax(logits, dim=-1, dtype=torch.float32),
targets,
reduction='sum',
)
2) All possible negative and positive pairs are selected. There's no sampling happening. I think, this results in ~1.4M samples per epoch.
3) We use lr=1e-5, bsz=32
, we also use dropout=0.2
in classification_head.
Let me know, if some details are unclear
@ngoyal2707 Thank you for your prompt reply!
All possible negative and positive pairs are selected. There's no sampling happening. I think, this results in ~1.4M samples per epoch.
Does this mean that, given an example, all combinations of positive and negative answers are selected? For example, if the example contains two positive answers A and B, and two negative answers C and D, are all combinations, namely (A, C), (A, D), (B, C), (B, D) generated from the example? If so, are all generated positive-negative pairs treated as independent training instances that are shuffled during training?
I read the issue #1598 and the updated paper of the RoBERTa model, and would like to ask questions regarding the training details on the ReCoRD dataset.
The paper mentions as follows: