Closed ghost closed 4 years ago
Hi,
Are you training on multiple GPUs using DistributedDataParallel or DataParallel?
Can you share your training script, where you are calling .cuda()
?
I am using DistributedDataParallel from NVIDIA Apex.
I am using this repository as a backbone for https://github.com/NVIDIA/retinanet-examples/blob/master/retinanet/train.py
hi @yashnv all our ImageNet training of TResNet were distributed.
usually, the only adjustments we need to do is some minor tweeks for Inplace-bn conversions, but in your case I see another problem.
i assume you call train.py from some distributed launch script, for example
python -m torch.distributed.launch --nproc_per_node=8 main.py
I think the problem is that you create the model before you set:
torch.cuda.set_device(rank)
as a quick fix, try to do:
self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)).cuda(device=int(os.environ.get('RANK', 0))).half()
two other problems that I see in the script:
(1) I recommend working with o1 apex, not o2
model, optimizer = amp.initialize(model, optimizer,
opt_level='O1' if mixed_precision else 'O0',
keep_batchnorm_fp32=True,
loss_scale=128.0,
verbosity=is_master)
o1 is more stable and less problematic. o2 need an adjustment for in place-bn.
(2) you need to convert the
model = convert_fixedbn_model(model)
to support also inplace_bn
@mrT23 I just call main.py. Model is automatically launched as Apex DistributedDataParallel if number of GPUs are >1.
The framework works well for other bacbone networks. Only when a TResNet backbone is called, it fails. This works well for other backbones, like ResNet/ResNeXt/EfficientNet.
I am guessing due to DownsampleJIT's self.filt doesn't get device rank.
I have already tried O1 optimization, and to handle this replaced inplaceBN with regular conv2d and BN. It again fails for DownsampleJIT's self.filt
Do you have some suggestion as how I can pass rank in DownsampleJIT call?
why haven't you tried what i suggested ? set:
self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)).cuda().half()
->
self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)).cuda(device=int(os.environ.get('RANK', 0))).half()
I got this:
RuntimeError:
attribute lookup is not defined on python value of type '_Environ':
File "/workspace/TResNet/src/models/tresnet/layers/anti_aliasing.py", line 35
filt = (a[:, None] * a[None, :]).clone().detach()
filt = filt / torch.sum(filt)
self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)).cuda(device=int(os.environ.get('RANK', 0))).half()
~~~~~~~~~~~~~~ <--- HERE
I also tried modifying the non-JIT Downsample to account for RANK, but that gave me the same originial error:
Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /tmp/pip-r eq-build-cms73_uj/aten/src/THCUNN/generic/SpatialDepthwiseConvolution.cu:19
Do you have some suggestions to make a custom grad function to account for multi-GPUs?
i added an option --remove_aa_jit. run with it, it should be ok for you.
As i said before, TResNet fully supports multi-GPUs, i trained on imagenet with 8xV100. your script is not well designed in terms of distributed. models should be defined after(!) you do 'torch.cuda.set_device(rank)', not before. if you insist on the opposite way, use the --remove_aa_jit flag.
i also added some general tips section for working with inplace-abn: https://github.com/mrT23/TResNet/blob/master/INPLACE_ABN_TIPS.md
all the best
Hi while using multiple GPUs for training I get this:
However single GPU training using
CUDA_VISIBLE_DEVICES=0
before my training script works fine. I can see the losses going down after iterations.Can you help with this?