Closed xk5663279 closed 3 years ago
ASL fully supports multi-gpu, we trained on multi-gpu. your problem is not related to ASL.
to validate this, switch to regular cross entropy, and you will see that your problem remains
Thank you very much for your quick reply.
If I change to other models(load models from pytorch-image-models), it worked for Multi-GPU. I think it is because of the anti-aliasing.py.
File "/data/xuekai/ASL/src/models/tresnet/layers/anti_aliasing.py", line 40, in call return F.conv2d(inputpad, self.filt, stride=2, padding=0, groups=input.shape[1]) RuntimeError: Assertion `THCTensor(checkGPU)(state, 3, input, output, weight)' failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /pytorch/aten/src/THCUNN/generic/SpatialDepthwiseConvolution.cu:19
Did you see such error before? Thank you in advance.
you are not initializing properly your multi-GPU environment
at the start of you run you need to have a line similar to:
def setup_distrib(model, args):
if num_distrib() > 1:
torch.cuda.set_device(args.local_rank)
if not torch.distributed.is_initialized():
torch.distributed.init_process_group(backend='nccl', init_method='env://')
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])
if your problem persists, use a different model than TResNet
@xk5663279 Issue39 is code for multi-gpu:https://github.com/Alibaba-MIIL/ASL/issues/39. But i don't recommend you using DistributedDataParallel to train ASL, because you need write all-reduce operator in mea_model part. Just use DataParallel is enogh, although DataParallel is lack of efficience in some ways.
@LOOKCC That issue (using DataParallel) is utterly wrong, from many perspectives.
I gave my recommendation, but you are welcome to choose your path alone. good luck :-)
@mrT23 I' am very sorry for my mistake in DDP, i was totally wrong. Because ema is done after model parameters updated. So there is no need for all-resuce in ema. But I found that, when using ddp, after one epoch training, the mAP of mea model is much lower than that training without it. (such as the log you provide in https://github.com/Alibaba-MIIL/ASL/issues/39, after one epoch, the ema mAP is 13.30, but use ddp this number is 4.x). Anyway, when using ddp, the grow of mea mAP is slower, do you know why? something related to batchsize? lr? lr_scheduler?
next week i am releasing a new article, that will be accompanied a full proper code that uses DDP. you can look there at the implementation and look for differences from your implementation
Stay tuned
@mrT23 Thank you very much. I look forward to your new research progress.
@LOOKCC Thank you very much for your kind reply. @mrT23 Thank you very much. I am looking forward to your new work.
Thank you very much for your work. I trained the ASL model using one gpu sucessfully. But when I trained it using multiple GPUs. Error occurred. So Is it possible to train ASL using multiple GPU?