Open KhemSon opened 2 years ago
I think TrainAugmentation causing this issue but not sure. To verify that I want to disable Image Augmentation.
I would remove operators from https://github.com/dusty-nv/pytorch-ssd/blob/21383204c68846bfff95acbbd93d39914a77c707/vision/ssd/data_preprocessing.py#L13 to determine which one is causing the NaN's
That's a nifty tip about torch.autograd.set_detect_anomaly()
, I will have to remember that.
Hi @dusty-nv, Thank you so much for your prompt response.
As per your suggestion, I have tried to remove each operator individually to determine NaN source but they all are giving non-NaN values. However, when I did isnan check on output from target_transform I'm able to locate the issue.
torch.log(center_form_boxes[..., 2:] / center_form_priors[..., 2:]) The above log term from the _convert_boxes_tolocations function causes this issue. Please refer to the following log.
2022-09-14 17:11:18 - Epoch: 0, Step: 1383/2195, Avg Loss: 12.3409, Avg Regression Loss 7.9730, Avg Classification Loss: 4.3678
center_form_boxes[..., :2]: tensor([[0.3156, 0.3982],
[0.7146, 0.3298],
[0.7146, 0.3298],
...,
[0.3233, 0.3999],
[0.3233, 0.3999],
[0.3233, 0.3999]])
center_form_priors[..., :2]: tensor([[0.0267, 0.0267],
[0.0267, 0.0267],
[0.0267, 0.0267],
...,
[0.5000, 0.5000],
[0.5000, 0.5000],
[0.5000, 0.5000]])
torch.log term: tensor([[ nan, nan],
[-3.5644, -2.0298],
[-3.6312, -1.4035],
...,
[-3.8496, -2.7807],
[-4.2475, -2.1801],
[-3.6469, -2.7807]])
2022-09-14 17:11:23 - Epoch: 0, Step: 1384/2195, Avg Loss: 4.9618, Avg Regression Loss 1.6924, Avg Classification Loss: 3.2693
2022-09-14 17:11:28 - Epoch: 0, Step: 1385/2195, Avg Loss: 7.9004, Avg Regression Loss 3.6440, Avg Classification Loss: 4.2564
Traceback (most recent call last):
File "train_ssd.py", line 412, in
Could you please suggest how to resolve that ?
I'm attaching the saved tensors files (using TORCH.SAVE()) which consist of _center_form_boxes, center_formpriors, and log term tensors.zip
The center_form_boxes consists of negative values and torch.log of negative value results in NaN
@dusty-nv please let me know what you think.
I'm not super familiar with all the details of the transforms, as I'm not the original author of the pytorch-ssd code. You could try logging an issue on the upstream github for it. Or if this condition only happens on a few items from your dataset, remove those from the dataset.
Hello, I got a same error with your case.
I solved this by using torch.nan_to_num()
function, it convert nan to 0, and also -inf to custom value.
You can check documemtation here(https://pytorch.org/docs/stable/generated/torch.nan_to_num.html).
I can't tell you it would not be affect to your model performance because I am new in machine learning, but I hope it could be helpful to you :D
same issue, plz refer my code and data @dusty-nv
import numpy as np
import torch
import torch.nn.functional as F
# d_gt = np.random.random([12, 12])
# d_pred = np.random.random([12, 12])
d_gt = np.load('C:/Users/liqiang.li/Downloads/20230216_082727__gt_locations.txt.npy')
d_pred = np.load('C:/Users/liqiang.li/Downloads/20230216_082727__predicted_locations.txt.npy')
smooth_l1_loss = F.smooth_l1_loss(torch.tensor(d_pred),
torch.tensor(d_gt),size_average=False
)
# smooth_l1_loss >> nan
print(smooth_l1_loss)
Please refer my code that fix the bug 👍
@dusty-nv @KhemSon thanks again for your seggestion of locating bug. @KhemSon
def convert_boxes_to_locations(center_form_boxes, center_form_priors, center_variance, size_variance):
# priors can have one dimension less
if center_form_priors.dim() + 1 == center_form_boxes.dim():
center_form_priors = center_form_priors.unsqueeze(0)
# fix nan bug,add relu function before log,leef,20230223
return torch.cat([
(center_form_boxes[..., :2] - center_form_priors[..., :2]) / center_form_priors[..., 2:] / center_variance,
torch.log(F.relu(center_form_boxes[..., 2:] / center_form_priors[..., 2:])+1e-7) / size_variance
],
dim=center_form_boxes.dim() - 1)
Hi,
I'm training SSD-Mobilenet Model on Bosch Small Traffic Lights Dataset.
While training, my Avg Loss is reducing slowly but suddenly I'm getting NaN. I followed the following methods but the issue still persists.
But after enabling Pytorch's Anomaly Detection i.e. torch.autograd.set_detect_anomaly(True), I'm able to find the instance and source of NaN. By further debugging, I have observed that one of the box locations in gt_location is having nan values (please refer to the following log)
image_id: 481834 predicted_locations: tensor([[ 1.4837, 1.2564, -6.5235, -2.5821], [ 0.6447, 0.8457, -16.9513, -11.4073], [ 2.0294, 0.9745, -15.5438, -14.0698], [ 1.8593, 1.0754, -15.8804, -14.4709], [ 2.0474, 1.3663, -15.7238, -14.4092]], grad_fn=)
gt_locations: tensor([[ 25.0286, 15.6667, nan, nan],
[ 4.0797, 2.3779, -13.1398, -8.8714],
[ 4.1841, 2.5611, -14.6530, -13.4025],
[ 2.0534, 0.6725, -13.3843, -12.9900],
[ 3.5518, 0.3255, -14.6399, -13.4983]])
regression_loss: nan | classification_loss: 3.4250411987304688 | loss: nan
/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py:175: UserWarning: Error detected in SmoothL1LossBackward0. Traceback of forward call that caused the error:
File "train_ssd.py", line 409, in
train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File "train_ssd.py", line 148, in train
regression_loss, classification_loss = criterion(confidence, locations, labels, boxes)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/content/gdrive/MyDrive/Colab Notebooks/Amrita/jetson-inference/python/training/detection/ssd/vision/nn/multibox_loss.py", line 45, in forward
smooth_l1_loss = F.smooth_l1_loss(predicted_locations, gt_locations, size_average=False)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 3188, in smooth_l1_loss
return torch._C._nn.smooth_l1_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction), beta)
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:102.)
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "train_ssd.py", line 409, in
train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File "train_ssd.py", line 153, in train
loss.backward()
File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py", line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'SmoothL1LossBackward0' returned nan values in its 0th output.
I think TrainAugmentation causing this issue but not sure. To verify that I want to disable Image Augmentation. @dusty-nv could you please let me know how to do that?
Thank you in advance!