facebookresearch / XLM

PyTorch original implementation of Cross-lingual Language Model Pretraining.
Other
2.87k stars 495 forks source link

confusion about `lm_head`'s size? #354

Open tnq177 opened 1 year ago

tnq177 commented 1 year ago
In [58] xlmr = torch.hub.load('pytorch/fairseq', 'xlmr.large')
In [59]: xlmr.model.encoder.lm_head
Out[59]:
RobertaLMHead(
  (dense): Linear(in_features=1024, out_features=1024, bias=True)
  (layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
In [60]: xlmr.model.encoder.lm_head.weight.size()
Out[60]: torch.Size([250002, 1024])

In [61]: xlmr.model.encoder.lm_head.bias.size()
Out[61]: torch.Size([250002])

If I understand correctly, the lm_head is simply the word embedding in tied embedding case. What I don't understand is why it shows a dense layer of size [1024, 1024] but upon inspecting the weight and bias, it shows [250002, 1024]? I would assume [250002, 1024] is the correct one.

tnq177 commented 1 year ago

oh i see the source code for lm_head, now worries.

tnq177 commented 1 year ago

Actually, I'm still confused. In the read of this repo, it instructs to Train your own XLM model with MLM or MLM+TLM using the train.py. Following train.py code, seems like it's using the Transformer implementation in this repo. However, in https://github.com/facebookresearch/XLM/blob/cd281d32612d145c6742b4d3f048f80df8669c30/xlm/model/transformer.py#L239, I can only see the last layer pred_layer using only the word embedding, no embed_dim->embed_dim dense layer as in https://github.com/facebookresearch/fairseq/blob/main/fairseq/models/roberta/model.py#L475. Which one is correct please?