Closed Orbis36 closed 2 years ago
Hi, could do provide your environment, such as GPU type? It seems that the environment that hailanyi used in #2 can work well. Or you can simply train a Voxel-R-CNN in our codes to check if the issue is caused by the SFD because we do not modify the model architecture before the progress runs into Line577.
Another possibility is that the network produces NaN weights during backpropagation, so you need to check if there are Inf or NaN in each loss item when calculating loss.
My card is 3090, with cuda11.3 and spconv2.1 I have checked the codes, and they can run successfully on Voxel-RCNN for the first 5 epochs. From your codes, I think you have taken this situation into consideration. so there is a NotImplement error. Could u tell me what I can do to provide more detailed Infos? And I think it won't spend you a long time to make it compatible with spconv 2.1, I'm not sure is there anything wrong when I modified your codes. Since you used the ' transform_points_to_voxels_valid' function instead of the one used in Voxel-RCNN
The implement of 'transform_points_to_voxels_valid' is same as 'transform_points_to_voxels'. You can copy your modified codes of 'transform_points_to_voxels' to 'transform_points_to_voxels_valid'. Or you can change the config to 'transform_points_to_voxels'. Anyway, remenber to modify the codes of 'transform_points_to_voxels' and 'transform_points_to_voxels_valid' to be compatible with spconv2. Hi, @hailanyi, could you give some advice?
In past few days, I also has this problem. This problem may be caused by the relatively large learning rate, or the point cloud is not normalized. In short, (1) appropriately reduce the learning rate, (2) normalize the initial features in VoxelBackBone8x. It will make the training easier.
@hailanyi Thanks! I have modified the first line of forward function in VoxelBackBone8x to
from torch.nn.functional import normalize
voxel_features, voxel_coords = normalize(batch_dict['voxel_features'], dim=0), batch_dict['voxel_coords']
And after that I trained the network with LR=0.01, while the problem is still there.
Since I used 4 as batch size, I think maybe I need less learning rate.
And I modified the LR to 0.005, it does make the training last longer, but after 10 epochs, I met this error as well.
So now I use the 0.001 as LR. hope it can bring me a good result.
So could you plz share your modification to solve this problem? Appreciate any help. -----------------------update---------------------------- The training with 0.001 LR met the error after 10 epochs as well I think maybe u can just send me the codes you use, if possible. My email is tianranliu20@gmail.com Wait for your message.
Ok, I have sent an email to you @Orbis36 , including a debug description and my codes.
Ok, I have sent an email to you @Orbis36 , including a debug description and my codes.
Thanks! I've replied to your email and got a satisfactory outcome successfully. So I just closed this issue here
Hi, @hailanyi I also have this problem. Could you send to me the codes that you've modified and debug information like you've sent to Orbis36? or some suggestions My email is zhangguoxin199805@163.com Thanks.
The core idea here is to normalize the input of pseudo features, specifically, you need to add the following codes after points_features[:, 3:6] /= 255.0 in sfd_head_utils.py.
from torch.nn.functional import normalize
points_features[:, :3] = normalize(points_features[:, :3], dim=0)
points_features[:, 6:] = normalize(points_features[:, 6:], dim=0)
Then everything is fine.
I have the same question after add the codes,it happens about in 5epoch
Hi, @hailanyi I also have this problem. Could you send to me the codes that you've modified and debug information like you've sent to Orbis36? or some suggestions My email is [getongao24@163.com]. Thanks a lot.
@hailanyi Could you send to me the codes that you've modified and debug information like you've sent to Orbis36? or some suggestions Email:karker991001@163.com
Hi, @hailanyi I also have this problem. Could you send to me the codes that you've modified and debug information like you've sent to Orbis36? or some suggestions My email is z_baishancha@126.com Thanks a lot!
Hi, @hailanyi I also have this problem. Could you send to me the codes that you've modified and debug information like you've sent to Orbis36? or some suggestions My email is megprab03@gmail.com Thanks a lot!
Hi, @hailanyi I also have this problem. Could you send to me the codes that you've modified and debug information like you've sent to Orbis36? or some suggestions My email is 1345097947@qq.com Thanks a lot!
Hello, everyone I found an error when I train this network on the other machine:
WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. epochs: 0%| | 0/40 [15:47<?, ?it/s, loss=nan, lr=0.00105] File "train.py", line 203, in
main()
File "train.py", line 175, in main
merge_all_iters_to_one_epoch=args.merge_all_iters_to_one_epoch
File "/home/tianran/workdir/SFD/tools/train_utils/train_utils.py", line 93, in train_model
dataloader_iter=dataloader_iter
File "/home/tianran/workdir/SFD/tools/train_utils/train_utils.py", line 38, in train_one_epoch
loss, tb_dict, disp_dict = model_func(model, batch)
File "/home/tianran/workdir/SFD/pcdet/models/init.py", line 30, in model_func
ret_dict, tb_dict, disp_dict = model(batch_dict)
File "/home/tianran/anaconda3/envs/SFDNet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/home/tianran/workdir/SFD/pcdet/models/detectors/sfd.py", line 11, in forward
batch_dict = cur_module(batch_dict)
File "/home/tianran/anaconda3/envs/SFDNet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, **kwargs)
File "/home/tianran/workdir/SFD/pcdet/models/roi_heads/sfd_head.py", line 579, in forward
targets_dict = self.assign_targets(batch_dict)
File "/home/tianran/workdir/SFD/pcdet/models/roi_heads/roi_head_template.py", line 117, in assign_targets
targets_dict = self.proposal_target_layer.forward(batch_dict)
File "/home/tianran/workdir/SFD/pcdet/models/roi_heads/target_assigner/proposal_target_layer.py", line 33, in forward
batch_dict=batch_dict
File "/home/tianran/workdir/SFD/pcdet/models/roi_heads/target_assigner/proposal_target_layer.py", line 109, in sample_rois_for_rcnn
sampled_inds = self.subsample_rois(max_overlaps=max_overlaps, frame_id=batch_dict['frame_id'])
File "/home/tianran/workdir/SFD/pcdet/models/roi_heads/target_assigner/proposal_target_layer.py", line 162, in subsample_rois
raise NotImplementedError
NotImplementedError
maxoverlaps:(min=nan, max=nan)
ERROR: FG=0, BG=0
It seems that the error is caused by the nan in tensor maxoverlap. So how can I fix this bug, it seems that it won't happen in a fixed iteration or epoch(i.e. in a fixed sample). I have checked the files used in the iteration by printing out the frame_id in batch_dict. However, I think everything is OK
Thanks for any help.