Open rguan1 opened 2 years ago
from a cursory examination, it looks like this might be failing because we can't fit the model activations in the first gpu spot. i'll need to investigate a bit more
This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.
I am trying to train a T5 model on empathetic dialogues. I am running into cuda OOM errors when training my model with the following command. When training the BlenderBot 3B model, I ran into this issue until I parallelized my training across two GPUs. However, it seems that parallelizing T5 3B doesn't resolve the issue. Also, I've reduced the batchsize to 1 and the truncate to 128 (truncate at 64 also doesn't work). Any suggestions to resolve the issue?
Command
parlai train_model -t empathetic_dialogues -m hugging_face/t5 --t5-model-arch t5-3b --t5-model-parallel True --fp16 True --optimizer adam --batchsize 1 --skip-generation True -vmt ppl -tr 64 --model-file ./chatbot_models/3B/testdebugT5/model --tstep 100
Error message