Open dedoogong opened 5 years ago
my environment is as below
PyTorch version: 1.1.0 Is debug build: No CUDA used to build PyTorch: 9.0.176
OS: Ubuntu 18.04.2 LTS GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 CMake version: version 3.13.3
Python version: 3.6 Is CUDA available: Yes CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce GTX 1080 GPU 1: GeForce GTX 1080
Nvidia driver version: 410.48 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.3.1 /usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudnn.so.7.4.2
Versions of relevant libraries: [pip3] msgpack-numpy==0.4.4.3 [pip3] numpy==1.16.4 [pip3] torch==1.1.0 [pip3] torchvision==0.2.1 [conda] Could not collect Pillow (6.0.0)2019-08-16 09:10:08,162 maskrcnn_benchmark INFO: PyTorch version: 1.1.0 Is debug build: No CUDA used to build PyTorch: 9.0.176
OS: Ubuntu 18.04.2 LTS GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 CMake version: version 3.13.3
Python version: 3.6 Is CUDA available: Yes CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce GTX 1080 GPU 1: GeForce GTX 1080
Nvidia driver version: 410.48 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.3.1 /usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudnn.so.7.4.2
Versions of relevant libraries: [pip3] msgpack-numpy==0.4.4.3 [pip3] numpy==1.16.4 [pip3] torch==1.1.0 [pip3] torchvision==0.2.1 [conda] Could not collect Pillow (6.0.0)
When I changed the O-level to O2 it shows different errors.
O2: expected backend CUDA and dtype Float but got backend CUDA and dtype Half
and then if I select O3, surprisingly, I could run it for the first iteration, that is, i could the normal loss values, but after that, it turned out all loss values are NaN(maybe because of gradient under/overflow)...
I found a temporal solution: applying @amp.float_function on sigmoid focal loss's forward() I can successfully train my model with mobilnetV2+FPN on FP16. But I want to use another backbone similar to mobilenetV2, and after changing the backbone to my fine-tuned mobilenet-like model, if I don't use find_unused_parameters=True option for torch.nn.parallel.DistributedDataParallel, it shows another error:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of forward
). You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True
to torch.nn.parallel.DistributedDataParallel
. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward
function. Please include the structure of the return value of forward
of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:408)
so, if I use the option, it shows another error,
Traceback (most recent call last):
File "tools/train_net.py", line 197, in
This error must be related to the "find_unused_parameters=True" option, as in torch/nn/parallel/disributed.py: 371
def forward(self, *inputs, **kwargs):
self._sync_params()
if self.device_ids:
inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
if len(self.device_ids) == 1:
output = self.module(*inputs[0], **kwargs[0])
else:
outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs)
output = self.gather(outputs, self.output_device)
else:
output = self.module(*inputs, **kwargs)
if torch.is_grad_enabled():
# We'll return the output object verbatim since it is a freeform
# object. We need to find any tensors in this object, though,
# because we need to figure out which parameters were used during
# this forward pass, to ensure we short circuit reduction for any
# unused parameters. Only if `find_unused_parameters` is set.
if self.find_unused_parameters:
**self.reducer.prepare_for_backward(list(_find_tensors(output)))**
else:
self.reducer.prepare_for_backward([])
that line caused the error and the option usage is written in the comment above.
So, I still need to find a way to avoid the assert grad.type == variable.type error...
UPDATE : I saw a strange result : I can solve the problem above using 1 GPUbut still not possible using more than 2 GPUs...
I have read many related issues but I coudlnt find a clever way to solve my problem.
Initially my model was trained on FP32 so loss function also implemented considering float32 (using template
global void SigmoidFocalLossForward..
template
global void SigmoidFocalLossBackward...
)and to speed up, I'm trying to train it again on FP16.
I didn't call model.half(), of course.
-----------------1st Trial----------------- [ train.py] model.to('cuda') model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) optimizer = make_optimizer(cfg, model) scheduler = make_lr_scheduler(cfg, optimizer)
then I got the first error: RuntimeError: "SigmoidFocalLoss_forward" not implemented for 'Half'
-----------------2nd Trial----------------- so I suspect the issue would be caused by loss function related to data type.
[loss.py] // sigmoid focal loss for calculating classification loss! // so "target"'s type must be "int" as target means the "class id of ground truth" class _SigmoidFocalLoss(Function):
but it triggers the error as below; ... ... losses.backward() File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 107, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: Function _SigmoidFocalLossBackward returned an invalid gradient at index 0 - expected type torch.cuda.HalfTensor but got torch.cuda.FloatTensor
-----------------3rd Trial----------------- >> "d_logits" are float type and my model's parameters are half/float mixed type layer by layer by apex, so I thought I need to convert the dtype of d_logits to half() as below. even if I change d_logits to half type as below, it shows similar error again.
... ... optimizer.step() File "/usr/local/lib/python3.6/dist-packages/apex/amp/_initialize.py", line 247, in new_step output = old_step(*args, *kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/optim/sgd.py", line 93, in step dp.add(weight_decay, p.data) File "/usr/local/lib/python3.6/dist-packages/apex/amp/wrap.py", line 101, in wrapper return orig_fn(arg0, args, kwargs) RuntimeError: expected backend CUDA and dtype Float but got backend CUDA and dtype Half**
I can print the values of d_logits in backward(..) function of _SigmoidFocalLoss. So, it is obvious that the error occurs after backward of the _SigmoidFocalLoss class.
-----------------4th Trial----------------- "return d_logits.half(), None, None, None, None" is replaced wjth
ERROR again! ... ... ... File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 390, in forward self.reducer.prepare_for_backward(list(_find_tensors(output))) RuntimeError: grad.type() == variable.type() ASSERT FAILED at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:214, please report a bug to PyTorch. (mark_variable_ready at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:214)
As far as I understand, model is run on mixed precision and only loss part run on fp32 during forward, and then in backward, it start from fp32 of loss results which is passed to model again, so I should change it to "half()" type again.
Uuuuuu after many trials to change the types of logits or target variables to float / int / half in both forward/backward fucntion, I can't find a nice solution...
Please anybody help me~!
Thank you!