allenai / longformer

Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
Apache License 2.0
2.03k stars 271 forks source link

GPU OOM when training XLM-RoBERTa with LongSelfAttention #91

Open KasparPeterson opened 4 years ago

KasparPeterson commented 4 years ago

Hi, thanks for the great example on training RoBERTa with long attention.

Followed this example: https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb Was able to successfully train for one epoch with the example notebook on Colab. After changing Roberta to XLMRoberta the model training does not fit into 16GB GPU memory.

In short, this is what I did: from transformers import RobertaForMaskedLM, RobertaTokenizerFast changed to from transformers import XLMRobertaForMaskedLM, XLMRobertaTokenizer

After some experimentation I tried installing apex and train with fp16 option but still facing the CUDA out of memory error. The experiment I ran is available in Colab: https://colab.research.google.com/drive/1lje_QTh6F3f9w0LoD0yB0mEUffGWkHG-?usp=sharing

Does anyone have any ideas on how to train XLM-RoBERTa with LongSelfAttention and why does it differ than much from RoBERTa?

Thanks!

ibeltagy commented 4 years ago

The XLMRoberta model is larger because it has a much bigger vocabulary. That might explain why Roberta fits in 16GB but XLMRoberta doesn't. If fp16 is not enough, you can try gradient checkpointing which should be able to save a lot of memory.