GPU OOM when training XLM-RoBERTa with LongSelfAttention

Hi, thanks for the great example on training RoBERTa with long attention.

Followed this example: https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb Was able to successfully train for one epoch with the example notebook on Colab. After changing Roberta to XLMRoberta the model training does not fit into 16GB GPU memory.

In short, this is what I did: from transformers import RobertaForMaskedLM, RobertaTokenizerFast changed to from transformers import XLMRobertaForMaskedLM, XLMRobertaTokenizer

After some experimentation I tried installing apex and train with fp16 option but still facing the CUDA out of memory error. The experiment I ran is available in Colab: https://colab.research.google.com/drive/1lje_QTh6F3f9w0LoD0yB0mEUffGWkHG-?usp=sharing

Does anyone have any ideas on how to train XLM-RoBERTa with LongSelfAttention and why does it differ than much from RoBERTa?

Thanks!

allenai / longformer

GPU OOM when training XLM-RoBERTa with LongSelfAttention #91