Layer Norm in XLM-R XL and XXL

stefan-it commented 3 years ago

Hi :)

I'm currently trying to convert the recently released XLM-R XL and XXL models into Transformers-compatible weights.

I'm using the latest fairseq master version (with commit 2fd9d8a972794ba919174baf0d1828a5a4c626f3) and there's something strange with the layer norm parameter.

For debugging, here are the parameter names (shortened) for the XLM-R Base model:

encoder.sentence_encoder.layernorm_embedding.weight        
encoder.sentence_encoder.layernorm_embedding.bias

the parameter name is layernorm_embedding. However, for the new XL models, it outputs:

encoder.sentence_encoder.layer_norm.weight
encoder.sentence_encoder.layer_norm.bias

So the parameter name is "layer_norm". When loading the model using fairseq library, like:

from fairseq.models.roberta import RobertaModel as FairseqRobertaModel

xlmr = FairseqRobertaModel.from_pretrained(roberta_checkpoint_path)
xlmr.eval()  # disable dropout

The (shortened) model list for XLM-R Base shows:

RobertaHubInterface(                                                                                 
  (model): RobertaModel(                                                                                  
    (encoder): RobertaEncoder(                                                
      (sentence_encoder): TransformerEncoder(                               
        (dropout_module): FairseqDropout()                                                               
        (embed_tokens): Embedding(250002, 768, padding_idx=1)               
        (embed_positions): LearnedPositionalEmbedding(514, 768, padding_idx=1)                           
        (layernorm_embedding): FusedLayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)

whereas the module list for the XL model shows:

RobertaHubInterface(                                                                                      
  (model): RobertaModel(                                                                              
    (encoder): RobertaEncoder(                                                                            
      (sentence_encoder): TransformerEncoder(                                                             
        (dropout_module): FairseqDropout()
        (embed_tokens): Embedding(250880, 2560, padding_idx=1)
        (embed_positions): LearnedPositionalEmbedding(514, 2560, padding_idx=1)

So a layer norm is missing in the XL model :thinking:

Side note: I've updates the conversion script in Transformers library to be compatible with latest fairseq master. At the end, the script compares a model (forward) pass between the original fairseq model and the converted model to see the differences. For the old XLM-R Base model. the output is identical, whereas for XLM-R XL the difference is very high. Script can be found here.

Thanks for your help!

ngoyal2707 commented 3 years ago

@stefan-it XLMR-base and large were post layernorm settings of transformer and XL and XXL are pre layernorm settings.

in preLN setting usually the embeddings are not normalized and there's an LN at the start of transformer block. Though there's extra LN at the end of transformer.

ngoyal2707 commented 3 years ago

You will need to create the HF transformer also in the same way to get same output

ricardorei commented 3 years ago

@ngoyal2707 independently of those changes between the base and large I can't load the new XL and XXL models using any fairseq version (without making changes to the state_dict).

If I use version 0.9.0 I get a bunch of unexpected keys because the "decoder" was renamed "encoder". If I use version >=0.10 I have unexpected keys on the emb_layer_norm which I assume was renamed to layer_norm.

RuntimeError: Error(s) in loading state_dict for RobertaModel:
        Missing key(s) in state_dict: "encoder.sentence_encoder.emb_layer_norm.weight", "encoder.sentence_encoder.emb_layer_norm.bias".
        Unexpected key(s) in state_dict: "encoder.sentence_encoder.layer_norm.weight", "encoder.sentence_encoder.layer_norm.bias", "encoder.sentence_encoder.version".

In any case, those checkpoints seem impossible to load without hacking around.

stefan-it commented 3 years ago

@ricardorei I installed fairseq via pip3 install git+https://github.com/pytorch/fairseq.git , as I've also seen different error messages for various fairseq version. But with latest master I could load the new larger models :hugs:

stefan-it commented 3 years ago

@ngoyal2707 Thanks for your explanation :+1: I could see the changes in 54423d3b22a3e7f536e02e9e5445cef9becbd60d so we're currently adjusting the RoBERTa model in Transformers to support the new models :)

Soonhwan-Kwon commented 3 years ago

I encountered same error, and it seems that layer_norm needs to be added in TransformerSentenceEncoder https://github.com/pytorch/fairseq/blob/master/fairseq/modules/transformer_sentence_encoder.py.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

facebookresearch / fairseq

Layer Norm in XLM-R XL and XXL #3600