Open KasparPeterson opened 4 years ago
The XLMRoberta
model is larger because it has a much bigger vocabulary. That might explain why Roberta fits in 16GB but XLMRoberta
doesn't. If fp16 is not enough, you can try gradient checkpointing which should be able to save a lot of memory.
Hi, thanks for the great example on training RoBERTa with long attention.
Followed this example: https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb Was able to successfully train for one epoch with the example notebook on Colab. After changing Roberta to XLMRoberta the model training does not fit into 16GB GPU memory.
In short, this is what I did:
from transformers import RobertaForMaskedLM, RobertaTokenizerFast
changed tofrom transformers import XLMRobertaForMaskedLM, XLMRobertaTokenizer
After some experimentation I tried installing apex and train with fp16 option but still facing the CUDA out of memory error. The experiment I ran is available in Colab: https://colab.research.google.com/drive/1lje_QTh6F3f9w0LoD0yB0mEUffGWkHG-?usp=sharing
Does anyone have any ideas on how to train XLM-RoBERTa with LongSelfAttention and why does it differ than much from RoBERTa?
Thanks!