Open xuan97916 opened 2 years ago
Hi,
What is the dataset.max_tokens
did you set? Also what is the maximum utterance length in your dataset? You can try removing the very long utterances from training by setting task.max_sample_size
if they aren't too many in your data.
I will encounter an OOM error halfway through the fifth epoch, and the checkpoint before loading will directly encounter an OOM error. Have you resolved it?
What is your question?
Dear authors, thanks a lot for this great work! I'm getting OOM while finetuning avhubert on my own dataset using multi-GPUs, and this error usually happens on non initial epoch:
fairseq-hydra-train --config-dir /my/config --config-name myconfig.yaml hydra.run.dir=../saved_model/20220311_1 common.user_dir=`pwd` distributed_training.ddp_backend=c10d distributed_training.distributed_world_size=4 distributed_training.nprocs_per_node=4
The OOM happens randomly on one GPU:
I have tried to use
no_c10d
andpytorch_ddp
as ddp_backend and tried to downgrade pytorch to 1.9.1 or 1.8.0 according to this issue and also have checked my dataset (using max_tokens instead of batch_size to prevent long sentence) , but these didn't work for me.What's your environment?
Thanks in advance for your comment!
All the best, An Hsu