NaN or Inf found in input tensor.

Raiden-cn commented 2 years ago

Hi,I occur this question when I training at 2th epoch. Here is my training command and error log as follows: python train.py --cfg_file cfgs/kitti_models/sfd.yaml --batch_size 2

INFO **Start training kitti_models/sfd(default)** epochs: 2%|██▋ | 1/40 [31:23<19:34:55, 1807.57s/it, loss=nan, lr=0.00109]NaN or Inf found in input tensor. | 82/1856 [01:15<29:36, 1.00s/it, total_it=1938] NaN or Inf found in input tensor. NaN or Inf found in input tensor. maxoverlaps:(min=nan, max=nan) ERROR: FG=0, BG=0 epochs: 2%|██▋ | 1/40 [31:3D-DET/SFD/output/kitti_models/sfd/default/ckpt Traceback (most recent call last): File "train.py", line 200, in main() File "train.py", line 172, in main merge_all_iters_to_one_epoch=args.merge_all_iters_to_one_epoch File "/home/zhangyu/3D-DET/SFD/tools/train_utils/train_utils.py", line 93, in train_model dataloader_iter=dataloader_iter File "/home/zhangyu/3D-DET/SFD/tools/train_utils/train_utils.py", line 38, in train_one_epoch loss, tb_dict, disp_dict = model_func(model, batch) File "/home/zhangyu/3D-DET/SFD/pcdet/models/init.py", line 30, in model_func ret_dict, tb_dict, disp_dict = model(batch_dict) File "/home/zhangyu/miniconda3/envs/SFD-3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/zhangyu/3D-DET/SFD/pcdet/models/detectors/sfd.py", line 11, in forward batch_dict = cur_module(batch_dict) File "/home/zhangyu/miniconda3/envs/SFD-3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, **kwargs) File "/home/zhangyu/3D-DET/SFD/pcdet/models/roi_heads/sfd_head.py", line 576, in forward targets_dict = self.assign_targets(batch_dict) File "/home/zhangyu/3D-DET/SFD/pcdet/models/roi_heads/roi_head_template.py", line 117, in assign_targets targets_dict = self.proposal_target_layer.forward(batch_dict) File "/home/zhangyu/3D-DET/SFD/pcdet/models/roi_heads/target_assigner/proposal_target_layer.py", line 33, in forward batch_dict=batch_dict File "/home/zhangyu/3D-DET/SFD/pcdet/models/roi_heads/target_assigner/proposal_target_layer.py", line 107, in sample_rois_for_rcnn sampled_inds = self.subsample_rois(max_overlaps=max_overlaps) File "/home/zhangyu/3D-DET/SFD/pcdet/models/roi_heads/target_assigner/proposal_target_layer.py", line 159, in subsample_rois raise NotImplementedError NotImplementedError

LittlePey commented 2 years ago

Hi, @Orbis36, Raiden-cn encounters a similar problem as you. How did you solve this problem finally? Thank you~

Orbis36 commented 2 years ago

The core idea here is to normalize the input of pseudo features, specifically, you need to add the following codes after points_features[:, 3:6] /= 255.0 in sfd_head_utils.py.

from torch.nn.functional import normalize points_features[:, :3] = normalize(points_features[:, :3], dim=0) points_features[:, 6:] = normalize(points_features[:, 6:], dim=0) Then everything is fine.

The Chinese version is here: https://zhuanlan.zhihu.com/p/524097054?

Raiden-cn commented 2 years ago

Thanks for your reply. I add the codes to specific part. I trained it successfully before 7th epoch. After that,the bug happens again.

Raiden-cn commented 2 years ago

Finally,I ran it successfully and got a excellent val result. I followed the core idea you mentioned. I have changed the batch_size and LR, so that it can trained . And everything is fine like you said. Thanks for your reply. Orbis36,you are my super hero! So I just close this issue now.

RG2806 commented 1 year ago

@Raiden-cn what Batch_Size and LR you used? I have the same bug after adding the above lines of code for normalization

Raiden-cn commented 1 year ago

Hi,@RG2806.

batch_size = 1
LR = 0.001

I also nomalized the input features in VoxelBackBone8x forward.

        from torch.nn.functional import normalize
        voxel_features, voxel_coords = normalize(batch_dict['voxel_features'], dim=0), batch_dict['voxel_coords']

CBY-9527 commented 1 year ago

Hi,I occur this question when I training at 2th epoch. Here is my training command and error log as follows: python train.py --cfg_file cfgs/kitti_models/sfd.yaml --batch_size 2

INFO **Start training kitti_models/sfd(default)** epochs: 2%|██▋ | 1/40 [31:23<19:34:55, 1807.57s/it, loss=nan, lr=0.00109]NaN or Inf found in input tensor. | 82/1856 [01:15<29:36, 1.00s/it, total_it=1938] NaN or Inf found in input tensor. NaN or Inf found in input tensor. maxoverlaps:(min=nan, max=nan) ERROR: FG=0, BG=0 epochs: 2%|██▋ | 1/40 [31:3D-DET/SFD/output/kitti_models/sfd/default/ckpt Traceback (most recent call last): File "train.py", line 200, in main() File "train.py", line 172, in main merge_all_iters_to_one_epoch=args.merge_all_iters_to_one_epoch File "/home/zhangyu/3D-DET/SFD/tools/train_utils/train_utils.py", line 93, in train_model dataloader_iter=dataloader_iter File "/home/zhangyu/3D-DET/SFD/tools/train_utils/train_utils.py", line 38, in train_one_epoch loss, tb_dict, disp_dict = model_func(model, batch) File "/home/zhangyu/3D-DET/SFD/pcdet/models/init.py", line 30, in model_func ret_dict, tb_dict, disp_dict = model(batch_dict) File "/home/zhangyu/miniconda3/envs/SFD-3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/zhangyu/3D-DET/SFD/pcdet/models/detectors/sfd.py", line 11, in forward batch_dict = cur_module(batch_dict) File "/home/zhangyu/miniconda3/envs/SFD-3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, **kwargs) File "/home/zhangyu/3D-DET/SFD/pcdet/models/roi_heads/sfd_head.py", line 576, in forward targets_dict = self.assign_targets(batch_dict) File "/home/zhangyu/3D-DET/SFD/pcdet/models/roi_heads/roi_head_template.py", line 117, in assign_targets targets_dict = self.proposal_target_layer.forward(batch_dict) File "/home/zhangyu/3D-DET/SFD/pcdet/models/roi_heads/target_assigner/proposal_target_layer.py", line 33, in forward batch_dict=batch_dict File "/home/zhangyu/3D-DET/SFD/pcdet/models/roi_heads/target_assigner/proposal_target_layer.py", line 107, in sample_rois_for_rcnn sampled_inds = self.subsample_rois(max_overlaps=max_overlaps) File "/home/zhangyu/3D-DET/SFD/pcdet/models/roi_heads/target_assigner/proposal_target_layer.py", line 159, in subsample_rois raise NotImplementedError NotImplementedError

Maybe you can try to replace giou with diou in the config file. When iou and gt do not intersectt which may lead to minoverlaps, maxoverlaps=Nan, diou can handle proposal regression in that case.

LittlePey / SFD

NaN or Inf found in input tensor. #5