LittlePey / SFD

Sparse Fuse Dense: Towards High Quality 3D Detection with Depth Completion (CVPR 2022, Oral)
Apache License 2.0
263 stars 35 forks source link

About the Nan of maxoverlap #3

Closed Orbis36 closed 2 years ago

Orbis36 commented 2 years ago

Hello, everyone I found an error when I train this network on the other machine:

WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. epochs: 0%| | 0/40 [15:47<?, ?it/s, loss=nan, lr=0.00105] File "train.py", line 203, in main() File "train.py", line 175, in main merge_all_iters_to_one_epoch=args.merge_all_iters_to_one_epoch File "/home/tianran/workdir/SFD/tools/train_utils/train_utils.py", line 93, in train_model dataloader_iter=dataloader_iter File "/home/tianran/workdir/SFD/tools/train_utils/train_utils.py", line 38, in train_one_epoch loss, tb_dict, disp_dict = model_func(model, batch) File "/home/tianran/workdir/SFD/pcdet/models/init.py", line 30, in model_func ret_dict, tb_dict, disp_dict = model(batch_dict) File "/home/tianran/anaconda3/envs/SFDNet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/tianran/workdir/SFD/pcdet/models/detectors/sfd.py", line 11, in forward batch_dict = cur_module(batch_dict) File "/home/tianran/anaconda3/envs/SFDNet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, **kwargs) File "/home/tianran/workdir/SFD/pcdet/models/roi_heads/sfd_head.py", line 579, in forward targets_dict = self.assign_targets(batch_dict) File "/home/tianran/workdir/SFD/pcdet/models/roi_heads/roi_head_template.py", line 117, in assign_targets targets_dict = self.proposal_target_layer.forward(batch_dict) File "/home/tianran/workdir/SFD/pcdet/models/roi_heads/target_assigner/proposal_target_layer.py", line 33, in forward batch_dict=batch_dict File "/home/tianran/workdir/SFD/pcdet/models/roi_heads/target_assigner/proposal_target_layer.py", line 109, in sample_rois_for_rcnn sampled_inds = self.subsample_rois(max_overlaps=max_overlaps, frame_id=batch_dict['frame_id']) File "/home/tianran/workdir/SFD/pcdet/models/roi_heads/target_assigner/proposal_target_layer.py", line 162, in subsample_rois raise NotImplementedError NotImplementedError maxoverlaps:(min=nan, max=nan) ERROR: FG=0, BG=0

It seems that the error is caused by the nan in tensor maxoverlap. So how can I fix this bug, it seems that it won't happen in a fixed iteration or epoch(i.e. in a fixed sample). I have checked the files used in the iteration by printing out the frame_id in batch_dict. However, I think everything is OK

Thanks for any help.

LittlePey commented 2 years ago

Hi, could do provide your environment, such as GPU type? It seems that the environment that hailanyi used in #2 can work well. Or you can simply train a Voxel-R-CNN in our codes to check if the issue is caused by the SFD because we do not modify the model architecture before the progress runs into Line577.

Another possibility is that the network produces NaN weights during backpropagation, so you need to check if there are Inf or NaN in each loss item when calculating loss.

Orbis36 commented 2 years ago

My card is 3090, with cuda11.3 and spconv2.1 I have checked the codes, and they can run successfully on Voxel-RCNN for the first 5 epochs. From your codes, I think you have taken this situation into consideration. so there is a NotImplement error. Could u tell me what I can do to provide more detailed Infos? And I think it won't spend you a long time to make it compatible with spconv 2.1, I'm not sure is there anything wrong when I modified your codes. Since you used the ' transform_points_to_voxels_valid' function instead of the one used in Voxel-RCNN

Screenshot from 2022-06-05 01-34-27

LittlePey commented 2 years ago

The implement of 'transform_points_to_voxels_valid' is same as 'transform_points_to_voxels'. You can copy your modified codes of 'transform_points_to_voxels' to 'transform_points_to_voxels_valid'. Or you can change the config to 'transform_points_to_voxels'. Anyway, remenber to modify the codes of 'transform_points_to_voxels' and 'transform_points_to_voxels_valid' to be compatible with spconv2. Hi, @hailanyi, could you give some advice?

hailanyi commented 2 years ago

In past few days, I also has this problem. This problem may be caused by the relatively large learning rate, or the point cloud is not normalized. In short, (1) appropriately reduce the learning rate, (2) normalize the initial features in VoxelBackBone8x. It will make the training easier.

Orbis36 commented 2 years ago

@hailanyi Thanks! I have modified the first line of forward function in VoxelBackBone8x to from torch.nn.functional import normalize
voxel_features, voxel_coords = normalize(batch_dict['voxel_features'], dim=0), batch_dict['voxel_coords'] And after that I trained the network with LR=0.01, while the problem is still there. Since I used 4 as batch size, I think maybe I need less learning rate. And I modified the LR to 0.005, it does make the training last longer, but after 10 epochs, I met this error as well. So now I use the 0.001 as LR. hope it can bring me a good result.

So could you plz share your modification to solve this problem? Appreciate any help. -----------------------update---------------------------- The training with 0.001 LR met the error after 10 epochs as well I think maybe u can just send me the codes you use, if possible. My email is tianranliu20@gmail.com Wait for your message.

hailanyi commented 2 years ago

Ok, I have sent an email to you @Orbis36 , including a debug description and my codes.

Orbis36 commented 2 years ago

Ok, I have sent an email to you @Orbis36 , including a debug description and my codes.

Thanks! I've replied to your email and got a satisfactory outcome successfully. So I just closed this issue here

Raiden-cn commented 2 years ago

Hi, @hailanyi I also have this problem. Could you send to me the codes that you've modified and debug information like you've sent to Orbis36? or some suggestions My email is zhangguoxin199805@163.com Thanks.

Orbis36 commented 2 years ago

The core idea here is to normalize the input of pseudo features, specifically, you need to add the following codes after points_features[:, 3:6] /= 255.0 in sfd_head_utils.py.

from torch.nn.functional import normalize points_features[:, :3] = normalize(points_features[:, :3], dim=0) points_features[:, 6:] = normalize(points_features[:, 6:], dim=0) Then everything is fine.

LittlePey commented 2 years ago

Hi, @Orbis36, you seem not reproduce our results. The moderate 3D mAP you provide in your blog is about 85% while our result is about 88%. We also call spconv in sfd_head.py lin7, line75 and line399, did you modify accordingly?

760440356 commented 2 years ago

I have the same question after add the codes,it happens about in 5epoch

Gmonster-24 commented 1 year ago

Hi, @hailanyi I also have this problem. Could you send to me the codes that you've modified and debug information like you've sent to Orbis36? or some suggestions My email is [getongao24@163.com]. Thanks a lot.

Karkers commented 1 year ago

@hailanyi Could you send to me the codes that you've modified and debug information like you've sent to Orbis36? or some suggestions Email:karker991001@163.com

Camellia-hz commented 1 year ago

Hi, @hailanyi I also have this problem. Could you send to me the codes that you've modified and debug information like you've sent to Orbis36? or some suggestions My email is z_baishancha@126.com Thanks a lot!

meggs98 commented 12 months ago

Hi, @hailanyi I also have this problem. Could you send to me the codes that you've modified and debug information like you've sent to Orbis36? or some suggestions My email is megprab03@gmail.com Thanks a lot!

HuangLLL123 commented 4 months ago

Hi, @hailanyi I also have this problem. Could you send to me the codes that you've modified and debug information like you've sent to Orbis36? or some suggestions My email is 1345097947@qq.com Thanks a lot!