Is it possible to train ASL using multiple GPU

Alibaba-MIIL / ASL

Official Pytorch Implementation of: "Asymmetric Loss For Multi-Label Classification"(ICCV, 2021) paper

MIT License

732 stars 102 forks source link

Is it possible to train ASL using multiple GPU #50

Closed xk5663279 closed 3 years ago

xk5663279 commented 3 years ago

Thank you very much for your work. I trained the ASL model using one gpu sucessfully. But when I trained it using multiple GPUs. Error occurred. So Is it possible to train ASL using multiple GPU?

mrT23 commented 3 years ago

ASL fully supports multi-gpu, we trained on multi-gpu. your problem is not related to ASL.

to validate this, switch to regular cross entropy, and you will see that your problem remains

xk5663279 commented 3 years ago

Thank you very much for your quick reply.

If I change to other models(load models from pytorch-image-models), it worked for Multi-GPU. I think it is because of the anti-aliasing.py.

File "/data/xuekai/ASL/src/models/tresnet/layers/anti_aliasing.py", line 40, in call return F.conv2d(inputpad, self.filt, stride=2, padding=0, groups=input.shape[1]) RuntimeError: Assertion `THCTensor(checkGPU)(state, 3, input, output, weight)' failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /pytorch/aten/src/THCUNN/generic/SpatialDepthwiseConvolution.cu:19

Did you see such error before? Thank you in advance.

mrT23 commented 3 years ago

you are not initializing properly your multi-GPU environment

at the start of you run you need to have a line similar to:

def setup_distrib(model, args):
    if num_distrib() > 1:
        torch.cuda.set_device(args.local_rank)
        if not torch.distributed.is_initialized():
            torch.distributed.init_process_group(backend='nccl', init_method='env://')
        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])

if your problem persists, use a different model than TResNet

LOOKCC commented 3 years ago

@xk5663279 Issue39 is code for multi-gpu:https://github.com/Alibaba-MIIL/ASL/issues/39. But i don't recommend you using DistributedDataParallel to train ASL, because you need write all-reduce operator in mea_model part. Just use DataParallel is enogh, although DataParallel is lack of efficience in some ways.

mrT23 commented 3 years ago

@LOOKCC That issue (using DataParallel) is utterly wrong, from many perspectives.

I always train with ddp. the article results were achieved with ddp, on 8 GPUs.
it has nothing to do with ASL specifically, it relates to any loss in ddp.
DataParallel is obsolete, its slow and wastes memory. always use ddp
for metrics calculation in ddp, you do need all_reduce. do that, is not that complicated. see line 767 in https://github.com/rwightman/pytorch-image-models/blob/master/train.py

I gave my recommendation, but you are welcome to choose your path alone. good luck :-)

LOOKCC commented 3 years ago

@mrT23 I' am very sorry for my mistake in DDP, i was totally wrong. Because ema is done after model parameters updated. So there is no need for all-resuce in ema. But I found that, when using ddp, after one epoch training, the mAP of mea model is much lower than that training without it. (such as the log you provide in https://github.com/Alibaba-MIIL/ASL/issues/39, after one epoch, the ema mAP is 13.30, but use ddp this number is 4.x). Anyway, when using ddp, the grow of mea mAP is slower, do you know why? something related to batchsize? lr? lr_scheduler?

mrT23 commented 3 years ago

next week i am releasing a new article, that will be accompanied a full proper code that uses DDP. you can look there at the implementation and look for differences from your implementation

Stay tuned

LOOKCC commented 3 years ago

@mrT23 Thank you very much. I look forward to your new research progress.

xk5663279 commented 3 years ago

@LOOKCC Thank you very much for your kind reply. @mrT23 Thank you very much. I am looking forward to your new work.