NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.43k stars 1.41k forks source link

how to avoid Half() and Float() type confilction of "loss" during backward pass? #431

Open dedoogong opened 5 years ago

dedoogong commented 5 years ago

I have read many related issues but I coudlnt find a clever way to solve my problem.

Initially my model was trained on FP32 so loss function also implemented considering float32 (using template global void SigmoidFocalLossForward..

template global void SigmoidFocalLossBackward...

)and to speed up, I'm trying to train it again on FP16.

I didn't call model.half(), of course.

-----------------1st Trial----------------- [ train.py] model.to('cuda') model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) optimizer = make_optimizer(cfg, model) scheduler = make_lr_scheduler(cfg, optimizer)

use_mixed_precision = True#cfg.DTYPE == "float16"
amp_opt_level = 'O1' if use_mixed_precision else 'O0'
model, optimizer = amp.initialize(model, optimizer, opt_level=amp_opt_level)

model = torch.nn.parallel.DistributedDataParallel(
              model, device_ids=[local_rank], output_device=local_rank,
              broadcast_buffers=False,find_unused_parameters=True
)

then I got the first error: RuntimeError: "SigmoidFocalLoss_forward" not implemented for 'Half'

-----------------2nd Trial----------------- so I suspect the issue would be caused by loss function related to data type.

[loss.py] // sigmoid focal loss for calculating classification loss! // so "target"'s type must be "int" as target means the "class id of ground truth" class _SigmoidFocalLoss(Function):

@staticmethod
def forward(ctx, logits, targets, gamma, alpha):
    **ctx.save_for_backward( logits.float() ,  targets.int() )**
    # as apex convert the dtype flows during 'model' in half() automatically, 
    # dtype of logits is half() implicitly. 
    # So, when I pass just 'logits.half()' it trigger 
    #                           -> "SigmoidFocalLoss_backward" not implemented for 'Half'
    # or ' logits.float()' -> _SigmoidFocalLossBackward returned an invalid gradient at index 0
    # - expected type torch.cuda.HalfTensor but got torch.cuda.FloatTensor
    # or 'logits.half()' -> "SigmoidFocalLoss_backward" not implemented for 'Half'
    # or 'logits.int()' ->  "SigmoidFocalLoss_backward" not implemented for 'Int'

    ...
    ...
    losses = _C.sigmoid_focalloss_forward(
        **logits.float(), targets.int(), num_classes, gamma, alpha**
    )
    return losses

@staticmethod
@once_differentiable
def backward(ctx, d_loss):
    logits, targets = ctx.saved_tensors
    ...
    ...
    d_loss = d_loss.contiguous()
    d_logits = _C.sigmoid_focalloss_backward(
        logits, targets, d_loss, num_classes, gamma, alpha
    )  
   ** return d_logits, None, None, None, None **

but it triggers the error as below; ... ... losses.backward() File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 107, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: Function _SigmoidFocalLossBackward returned an invalid gradient at index 0 - expected type torch.cuda.HalfTensor but got torch.cuda.FloatTensor

-----------------3rd Trial----------------- >> "d_logits" are float type and my model's parameters are half/float mixed type layer by layer by apex, so I thought I need to convert the dtype of d_logits to half() as below. even if I change d_logits to half type as below, it shows similar error again.

... ... optimizer.step() File "/usr/local/lib/python3.6/dist-packages/apex/amp/_initialize.py", line 247, in new_step output = old_step(*args, *kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/optim/sgd.py", line 93, in step dp.add(weight_decay, p.data) File "/usr/local/lib/python3.6/dist-packages/apex/amp/wrap.py", line 101, in wrapper return orig_fn(arg0, args, kwargs) RuntimeError: expected backend CUDA and dtype Float but got backend CUDA and dtype Half**

I can print the values of d_logits in backward(..) function of _SigmoidFocalLoss. So, it is obvious that the error occurs after backward of the _SigmoidFocalLoss class.

-----------------4th Trial----------------- "return d_logits.half(), None, None, None, None" is replaced wjth

    "d_logits=d_logits.half()
    return d_logits, None, None, None, None"

ERROR again! ... ... ... File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 390, in forward self.reducer.prepare_for_backward(list(_find_tensors(output))) RuntimeError: grad.type() == variable.type() ASSERT FAILED at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:214, please report a bug to PyTorch. (mark_variable_ready at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:214)

As far as I understand, model is run on mixed precision and only loss part run on fp32 during forward, and then in backward, it start from fp32 of loss results which is passed to model again, so I should change it to "half()" type again.

  1. [FP16] ---- [FP32/FP16]------[FP32] Input Image tensor -> model(parameters) -> loss __model(parameters) <- loss during the Backpropagataion, weights are updated in FP16.

Uuuuuu after many trials to change the types of logits or target variables to float / int / half in both forward/backward fucntion, I can't find a nice solution...

Please anybody help me~!

Thank you!

dedoogong commented 5 years ago

my environment is as below

PyTorch version: 1.1.0 Is debug build: No CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 18.04.2 LTS GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 CMake version: version 3.13.3

Python version: 3.6 Is CUDA available: Yes CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce GTX 1080 GPU 1: GeForce GTX 1080

Nvidia driver version: 410.48 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.3.1 /usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudnn.so.7.4.2

Versions of relevant libraries: [pip3] msgpack-numpy==0.4.4.3 [pip3] numpy==1.16.4 [pip3] torch==1.1.0 [pip3] torchvision==0.2.1 [conda] Could not collect Pillow (6.0.0)2019-08-16 09:10:08,162 maskrcnn_benchmark INFO: PyTorch version: 1.1.0 Is debug build: No CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 18.04.2 LTS GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 CMake version: version 3.13.3

Python version: 3.6 Is CUDA available: Yes CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce GTX 1080 GPU 1: GeForce GTX 1080

Nvidia driver version: 410.48 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.3.1 /usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudnn.so.7.4.2

Versions of relevant libraries: [pip3] msgpack-numpy==0.4.4.3 [pip3] numpy==1.16.4 [pip3] torch==1.1.0 [pip3] torchvision==0.2.1 [conda] Could not collect Pillow (6.0.0)

dedoogong commented 5 years ago

When I changed the O-level to O2 it shows different errors.

O2: expected backend CUDA and dtype Float but got backend CUDA and dtype Half

and then if I select O3, surprisingly, I could run it for the first iteration, that is, i could the normal loss values, but after that, it turned out all loss values are NaN(maybe because of gradient under/overflow)...

dedoogong commented 5 years ago

I found a temporal solution: applying @amp.float_function on sigmoid focal loss's forward() I can successfully train my model with mobilnetV2+FPN on FP16. But I want to use another backbone similar to mobilenetV2, and after changing the backbone to my fine-tuned mobilenet-like model, if I don't use find_unused_parameters=True option for torch.nn.parallel.DistributedDataParallel, it shows another error:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of forward). You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:408)

so, if I use the option, it shows another error,

Traceback (most recent call last): File "tools/train_net.py", line 197, in main() File "tools/train_net.py", line 190, in main model = train(cfg, args.local_rank, args.distributed) File "tools/train_net.py", line 86, in train arguments, File "/usr/local/lib/python3.6/dist-packages/fcos-0.1.9-py3.6-linux-x86_64.egg/fcos_core/engine/trainer.py", line 78, in do_train loss_dict = model(images, targets) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 390, in forward self.reducer.prepare_for_backward(list(_find_tensors(output))) RuntimeError: grad.type() == variable.type() ASSERT FAILED at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:214, please report a bug to PyTorch.** (mark_variable_ready at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:214) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f3c5c158441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f3c5c157d7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so) frame #2: c10d::Reducer::mark_variable_ready(unsigned long, unsigned long, bool) + 0x680 (0x7f3c5cc903c0 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so) frame #3: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator > const&) + 0x49a (0x7f3c5cc916ea in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)

This error must be related to the "find_unused_parameters=True" option, as in torch/nn/parallel/disributed.py: 371

def forward(self, *inputs, **kwargs):
    self._sync_params()
    if self.device_ids:
        inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
        if len(self.device_ids) == 1:
            output = self.module(*inputs[0], **kwargs[0])
        else:
            outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs)
            output = self.gather(outputs, self.output_device)
    else:
        output = self.module(*inputs, **kwargs)

    if torch.is_grad_enabled():
        # We'll return the output object verbatim since it is a freeform
        # object. We need to find any tensors in this object, though,
        # because we need to figure out which parameters were used during
        # this forward pass, to ensure we short circuit reduction for any
        # unused parameters. Only if `find_unused_parameters` is set.
        if self.find_unused_parameters:
            **self.reducer.prepare_for_backward(list(_find_tensors(output)))**
        else:
            self.reducer.prepare_for_backward([])

that line caused the error and the option usage is written in the comment above.

So, I still need to find a way to avoid the assert grad.type == variable.type error...

dedoogong commented 5 years ago

UPDATE : I saw a strange result : I can solve the problem above using 1 GPUbut still not possible using more than 2 GPUs...