facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
https://mmf.sh/
Other
5.44k stars 925 forks source link

Error torch.distributed when running #1309

Open tinaboya2023 opened 1 year ago

tinaboya2023 commented 1 year ago

Hi, I am working on one of the extended mmf projects. But when I run it with below command, I get the following error. Of course, it should be noted that I have encountered this error in other extended pythia frameworks. Command for running:
python -m torch.distributed.launch --nproc_per_node 1 tools/run.py --pretrain --tasks vqa --datasets m4c_textvqa --model m4c_split --seed 13 --config configs/vqa/m4c_textvqa/tap_base_pretrain.yml --save_dir save/m4c_split_pretrain_test training_parameters.distributed True

Error: 3

//////////////// I install environment with below information python=3.8 pytorch,cuda with command=conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia GPU= 1 geforce RTX 3090 (24 GPU-RAM) ///////////////// Could you help me to solve this problem? Is this error because of using 1 GPU? Do I need to change the initial value of a some parameters(like local_rank)? Could the reason for this error be due to lack of GPU-memory? It is very important to me to solve this problem and I would be very grateful if you could guide me.

pbontrager commented 1 year ago

Hello, it is hard to find the root cause from these logs as anything causing the child to crash would cause this. Often times this is caused due to running out of ram or gpu ram. So one quick check you could do would be to lower the batch size and see if that stops the issue. Otherwise please try to get a traceback and share it here.