Open KleinXin opened 3 years ago
I have the same problem,and I change "nums_work=4" to "nums_work=0" ,it runs. but,the bitch size only 1,I can not chang it.
I have the same problem,and I change "nums_work=4" to "nums_work=0" ,it runs. but,the bitch size only 1,I can not chang it.
Are you also using docker? I think it is the problem of docker which is not quite compatible with pytorch distributed trainning
I have the same problem,and I change "nums_work=4" to "nums_work=0" ,it runs. but,the bitch size only 1,I can not chang it.
Are you also using docker? I think it is the problem of docker which is not quite compatible with pytorch distributed trainning
I did not use docker. may be you are right
I'm having the same problem, launching on only one GPU needs too much memory and the multi-GPU launch is not working, were you able to solve this ?
I'm having the same problem, launching on only one GPU needs too much memory and the multi-GPU launch is not working, were you able to solve this ?
Does anyone solve this problem?
I have the same problem,and I change "nums_work=4" to "nums_work=0" ,it runs. but,the bitch size only 1,I can not chang it.
Are you also using docker? I think it is the problem of docker which is not quite compatible with pytorch distributed trainning
I did not use docker. may be you are right
Do you solve this problem?
I solved this problem. You should modify train.py --local_rank
use below default setting :
parser.add_argument("--local_rank", type=int, default=os.getenv('LOCAL_RANK', 0), help="local_rank for distributed training on gpus")
reference : https://github.com/NVIDIA/apex/blob/master/examples/imagenet/main_amp.py
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node=4 train.py xxx
The codes above cannot run on multiple GPUs.
It is weird that all the trainning are running on the first GPU. Then if the batch size is increased, OOM error is reported.
Any one knows what's wrong?