Multi GPU training error

ghost commented 4 years ago

Hi while using multiple GPUs for training I get this:

File "/workspace/TResNet/src/models/tresnet/layers/anti_aliasing.py", line 40, in __call__    
    return F.conv2d(input_pad, self.filt, stride=2, padding=0, groups=input.shape[1])
RuntimeError: Assertion `THCTensor_(checkGPU)(state, 3, input, output, weight)' failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /tmp/pip-r
eq-build-cms73_uj/aten/src/THCUNN/generic/SpatialDepthwiseConvolution.cu:19

However single GPU training using CUDA_VISIBLE_DEVICES=0 before my training script works fine. I can see the losses going down after iterations.

Can you help with this?

hussam789 commented 4 years ago

Hi, Are you training on multiple GPUs using DistributedDataParallel or DataParallel? Can you share your training script, where you are calling .cuda() ?

ghost commented 4 years ago

I am using DistributedDataParallel from NVIDIA Apex.

I am using this repository as a backbone for https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/train.py

mrT23 commented 4 years ago

hi @yashnv all our ImageNet training of TResNet were distributed.

usually, the only adjustments we need to do is some minor tweeks for Inplace-bn conversions, but in your case I see another problem.

i assume you call train.py from some distributed launch script, for example

python -m torch.distributed.launch --nproc_per_node=8 main.py

I think the problem is that you create the model before you set:

torch.cuda.set_device(rank)

as a quick fix, try to do: self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)).cuda(device=int(os.environ.get('RANK', 0))).half()

two other problems that I see in the script:

(1) I recommend working with o1 apex, not o2

    model, optimizer = amp.initialize(model, optimizer,
                                      opt_level='O1' if mixed_precision else 'O0',
                                      keep_batchnorm_fp32=True,
                                      loss_scale=128.0,
                                      verbosity=is_master)

o1 is more stable and less problematic. o2 need an adjustment for in place-bn.

(2) you need to convert the model = convert_fixedbn_model(model) to support also inplace_bn

ghost commented 4 years ago

@mrT23 I just call main.py. Model is automatically launched as Apex DistributedDataParallel if number of GPUs are >1.

The framework works well for other bacbone networks. Only when a TResNet backbone is called, it fails. This works well for other backbones, like ResNet/ResNeXt/EfficientNet.

I am guessing due to DownsampleJIT's self.filt doesn't get device rank.

I have already tried O1 optimization, and to handle this replaced inplaceBN with regular conv2d and BN. It again fails for DownsampleJIT's self.filt

Do you have some suggestion as how I can pass rank in DownsampleJIT call?

mrT23 commented 4 years ago

why haven't you tried what i suggested ? set:

self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)).cuda().half() -> self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)).cuda(device=int(os.environ.get('RANK', 0))).half()

ghost commented 4 years ago

I got this:

RuntimeError:
attribute lookup is not defined on python value of type '_Environ':
  File "/workspace/TResNet/src/models/tresnet/layers/anti_aliasing.py", line 35
        filt = (a[:, None] * a[None, :]).clone().detach()
        filt = filt / torch.sum(filt)
        self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)).cuda(device=int(os.environ.get('RANK', 0))).half()
                                                                                            ~~~~~~~~~~~~~~ <--- HERE

I also tried modifying the non-JIT Downsample to account for RANK, but that gave me the same originial error: Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /tmp/pip-r eq-build-cms73_uj/aten/src/THCUNN/generic/SpatialDepthwiseConvolution.cu:19

Do you have some suggestions to make a custom grad function to account for multi-GPUs?

mrT23 commented 4 years ago

i added an option --remove_aa_jit. run with it, it should be ok for you.

As i said before, TResNet fully supports multi-GPUs, i trained on imagenet with 8xV100. your script is not well designed in terms of distributed. models should be defined after(!) you do 'torch.cuda.set_device(rank)', not before. if you insist on the opposite way, use the --remove_aa_jit flag.

i also added some general tips section for working with inplace-abn: https://github.com/mrT23/TResNet/blob/master/INPLACE_ABN_TIPS.md

all the best

Alibaba-MIIL / TResNet

Multi GPU training error #7