Re-training SSD-Mobilenet: gt_locations consist of nan values which causing Regression Loss to NaN

KhemSon commented 2 years ago

Hi,

I'm training SSD-Mobilenet Model on Bosch Small Traffic Lights Dataset.

While training, my Avg Loss is reducing slowly but suddenly I'm getting NaN. I followed the following methods but the issue still persists.

https://forums.developer.nvidia.com/t/error-training-with-jetson-inference/210095 I have verified the image's XML files and they look fine. Sometimes I'm not getting any NaN value for 'epoch 0'
Tuning learning rate i.e. 0.01, 0.001, 0.0001 etc
Using ADAM Optimizer

But after enabling Pytorch's Anomaly Detection i.e. torch.autograd.set_detect_anomaly(True), I'm able to find the instance and source of NaN. By further debugging, I have observed that one of the box locations in gt_location is having nan values (please refer to the following log)

image_id: 481834 predicted_locations: tensor([[ 1.4837, 1.2564, -6.5235, -2.5821], [ 0.6447, 0.8457, -16.9513, -11.4073], [ 2.0294, 0.9745, -15.5438, -14.0698], [ 1.8593, 1.0754, -15.8804, -14.4709], [ 2.0474, 1.3663, -15.7238, -14.4092]], grad_fn=) gt_locations: tensor([[ 25.0286, 15.6667, nan, nan], [ 4.0797, 2.3779, -13.1398, -8.8714], [ 4.1841, 2.5611, -14.6530, -13.4025], [ 2.0534, 0.6725, -13.3843, -12.9900], [ 3.5518, 0.3255, -14.6399, -13.4983]]) regression_loss: nan | classification_loss: 3.4250411987304688 | loss: nan /usr/local/lib/python3.7/dist-packages/torch/autograd/init.py:175: UserWarning: Error detected in SmoothL1LossBackward0. Traceback of forward call that caused the error: File "train_ssd.py", line 409, in train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch) File "train_ssd.py", line 148, in train regression_loss, classification_loss = criterion(confidence, locations, labels, boxes) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/content/gdrive/MyDrive/Colab Notebooks/Amrita/jetson-inference/python/training/detection/ssd/vision/nn/multibox_loss.py", line 45, in forward smooth_l1_loss = F.smooth_l1_loss(predicted_locations, gt_locations, size_average=False) File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 3188, in smooth_l1_loss return torch._C._nn.smooth_l1_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction), beta) (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:102.) allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass Traceback (most recent call last): File "train_ssd.py", line 409, in train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch) File "train_ssd.py", line 153, in train loss.backward() File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py", line 175, in backward allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass RuntimeError: Function 'SmoothL1LossBackward0' returned nan values in its 0th output.

I think TrainAugmentation causing this issue but not sure. To verify that I want to disable Image Augmentation. @dusty-nv could you please let me know how to do that?

Thank you in advance!

dusty-nv commented 2 years ago

I think TrainAugmentation causing this issue but not sure. To verify that I want to disable Image Augmentation.

I would remove operators from https://github.com/dusty-nv/pytorch-ssd/blob/21383204c68846bfff95acbbd93d39914a77c707/vision/ssd/data_preprocessing.py#L13 to determine which one is causing the NaN's

That's a nifty tip about torch.autograd.set_detect_anomaly(), I will have to remember that.

KhemSon commented 2 years ago

Hi @dusty-nv, Thank you so much for your prompt response.

As per your suggestion, I have tried to remove each operator individually to determine NaN source but they all are giving non-NaN values. However, when I did isnan check on output from target_transform I'm able to locate the issue.

https://github.com/dusty-nv/pytorch-ssd/blob/21383204c68846bfff95acbbd93d39914a77c707/vision/utils/box_utils.py#L115

torch.log(center_form_boxes[..., 2:] / center_form_priors[..., 2:]) The above log term from the _convert_boxes_tolocations function causes this issue. Please refer to the following log.

2022-09-14 17:11:18 - Epoch: 0, Step: 1383/2195, Avg Loss: 12.3409, Avg Regression Loss 7.9730, Avg Classification Loss: 4.3678 center_form_boxes[..., :2]: tensor([[0.3156, 0.3982], [0.7146, 0.3298], [0.7146, 0.3298], ..., [0.3233, 0.3999], [0.3233, 0.3999], [0.3233, 0.3999]]) center_form_priors[..., :2]: tensor([[0.0267, 0.0267], [0.0267, 0.0267], [0.0267, 0.0267], ..., [0.5000, 0.5000], [0.5000, 0.5000], [0.5000, 0.5000]]) torch.log term: tensor([[ nan, nan], [-3.5644, -2.0298], [-3.6312, -1.4035], ..., [-3.8496, -2.7807], [-4.2475, -2.1801], [-3.6469, -2.7807]]) 2022-09-14 17:11:23 - Epoch: 0, Step: 1384/2195, Avg Loss: 4.9618, Avg Regression Loss 1.6924, Avg Classification Loss: 3.2693 2022-09-14 17:11:28 - Epoch: 0, Step: 1385/2195, Avg Loss: 7.9004, Avg Regression Loss 3.6440, Avg Classification Loss: 4.2564 Traceback (most recent call last): File "train_ssd.py", line 412, in train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch) File "train_ssd.py", line 139, in train for i, data in enumerate(loader): File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 681, in next data = self._next_data() File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1376, in _next_data return self._process_data(data) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1402, in _process_data data.reraise() File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 461, in reraise raise exception AssertionError: Caught AssertionError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataset.py", line 235, in getitem return self.datasets[dataset_idx][sample_idx] File "/content/gdrive/MyDrive/Colab Notebooks/Amrita/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py", line 93, in getitem boxes, labels = self.target_transform(boxes, labels) File "/content/gdrive/MyDrive/Colab Notebooks/Amrita/jetson-inference/python/training/detection/ssd/vision/ssd/ssd.py", line 174, in call locations = box_utils.convert_boxes_to_locations(boxes, self.center_form_priors, self.center_variance, self.size_variance) File "/content/gdrive/MyDrive/Colab Notebooks/Amrita/jetson-inference/python/training/detection/ssd/vision/utils/box_utils.py", line 119, in convert_boxes_to_locations assert not torch.isnan(torch.log(center_form_boxes[..., 2:] / center_form_priors[..., 2:])).any() AssertionError

Could you please suggest how to resolve that ?

KhemSon commented 2 years ago

I'm attaching the saved tensors files (using TORCH.SAVE()) which consist of _center_form_boxes, center_formpriors, and log term tensors.zip

The center_form_boxes consists of negative values and torch.log of negative value results in NaN

@dusty-nv please let me know what you think.

dusty-nv commented 2 years ago

I'm not super familiar with all the details of the transforms, as I'm not the original author of the pytorch-ssd code. You could try logging an issue on the upstream github for it. Or if this condition only happens on a few items from your dataset, remove those from the dataset.

melli0505 commented 2 years ago

Hello, I got a same error with your case. I solved this by using torch.nan_to_num() function, it convert nan to 0, and also -inf to custom value. You can check documemtation here(https://pytorch.org/docs/stable/generated/torch.nan_to_num.html).

I can't tell you it would not be affect to your model performance because I am new in machine learning, but I hope it could be helpful to you :D

leaf918 commented 1 year ago

same issue, plz refer my code and data @dusty-nv

import numpy as np
import torch
import torch.nn.functional as F

# d_gt = np.random.random([12, 12])
# d_pred = np.random.random([12, 12])
d_gt = np.load('C:/Users/liqiang.li/Downloads/20230216_082727__gt_locations.txt.npy')
d_pred = np.load('C:/Users/liqiang.li/Downloads/20230216_082727__predicted_locations.txt.npy')

smooth_l1_loss = F.smooth_l1_loss(torch.tensor(d_pred),
                                  torch.tensor(d_gt),size_average=False
                                  )
# smooth_l1_loss  >> nan
print(smooth_l1_loss)

20230216_082727__predicted_locations.txt.zip

leaf918 commented 1 year ago

Please refer my code that fix the bug 👍

@dusty-nv @KhemSon thanks again for your seggestion of locating bug. @KhemSon


def convert_boxes_to_locations(center_form_boxes, center_form_priors, center_variance, size_variance):
    # priors can have one dimension less
    if center_form_priors.dim() + 1 == center_form_boxes.dim():
        center_form_priors = center_form_priors.unsqueeze(0)

    # fix nan bug,add relu function before log,leef,20230223
    return torch.cat([
        (center_form_boxes[..., :2] - center_form_priors[..., :2]) / center_form_priors[..., 2:] / center_variance,
        torch.log(F.relu(center_form_boxes[..., 2:] / center_form_priors[..., 2:])+1e-7) / size_variance
    ],
        dim=center_form_boxes.dim() - 1)

dusty-nv / jetson-inference

Re-training SSD-Mobilenet: gt_locations consist of nan values which causing Regression Loss to NaN #1495