Failed to run on multi GPUs

TACJu / TransFG

This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

MIT License

384 stars 87 forks source link

Failed to run on multi GPUs #5

Open KleinXin opened 3 years ago

KleinXin commented 3 years ago

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node=4 train.py xxx

The codes above cannot run on multiple GPUs.

It is weird that all the trainning are running on the first GPU. Then if the batch size is increased, OOM error is reported.

Any one knows what's wrong?

cwq63 commented 3 years ago

I have the same problem，and I change "nums_work=4" to "nums_work=0" ,it runs. but,the bitch size only 1,I can not chang it.

KleinXin commented 3 years ago

I have the same problem，and I change "nums_work=4" to "nums_work=0" ,it runs. but,the bitch size only 1,I can not chang it.

Are you also using docker? I think it is the problem of docker which is not quite compatible with pytorch distributed trainning

cwq63 commented 3 years ago

I have the same problem，and I change "nums_work=4" to "nums_work=0" ,it runs. but,the bitch size only 1,I can not chang it.

Are you also using docker? I think it is the problem of docker which is not quite compatible with pytorch distributed trainning

I did not use docker. may be you are right

mayasahraoui commented 3 years ago

I'm having the same problem, launching on only one GPU needs too much memory and the multi-GPU launch is not working, were you able to solve this ?

karry-11110 commented 3 years ago

I'm having the same problem, launching on only one GPU needs too much memory and the multi-GPU launch is not working, were you able to solve this ?

Does anyone solve this problem？

karry-11110 commented 3 years ago

I have the same problem，and I change "nums_work=4" to "nums_work=0" ,it runs. but,the bitch size only 1,I can not chang it.

Are you also using docker? I think it is the problem of docker which is not quite compatible with pytorch distributed trainning

I did not use docker. may be you are right

Do you solve this problem?

haoweiz23 commented 2 years ago

I solved this problem. You should modify train.py --local_rank

use below default setting :

parser.add_argument("--local_rank", type=int, default=os.getenv('LOCAL_RANK', 0), help="local_rank for distributed training on gpus")

reference : https://github.com/NVIDIA/apex/blob/master/examples/imagenet/main_amp.py