V2AI / Det3D

World's first general purpose 3D object detection codebse.
https://arxiv.org/abs/1908.09492
Apache License 2.0
1.5k stars 299 forks source link

Try to train the CBGS , all INFO value are nan #6

Closed muzi2045 closed 4 years ago

muzi2045 commented 4 years ago

After modify some configs and compile the nms_gpu module successfully, I am trying to train the CBGS network in my local computer with Nuscenens Dataset, Not using the train.sh, but directly use

python3 train.py 
/home/muzi2045/Documents/Det3D/examples/cbgs/configs/nusc_all_vfev3_spmiddleresnetfhd_rpn2_mghead_syncbn.py  --gpus=1

it can run , but the output in log file are nan value

2019-12-21 14:56:28,351 - INFO - Start running, host: muzi2045@muzi2045-MS-7B48, work_dir: /home/muzi2045/Documents/Det3D/trained_model
2019-12-21 14:56:28,351 - INFO - workflow: [('train', 1), ('val', 1)], max: 20 epochs
2019-12-21 14:56:57,005 - INFO - Epoch [1/20][50/64050] lr: 0.00010, eta: 8 days, 11:53:41, time: 0.573, data_time: 0.178, transfer_time: 0.012, forward_time: 0.112, loss_parse_time: 0.000 memory: 1689, 
2019-12-21 14:56:57,005 - INFO - task : ['car'], loss: nan, cls_pos_loss: nan, cls_neg_loss: nan, dir_loss_reduced: nan, cls_loss_reduced: nan, loc_loss_reduced: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_pos: 26.4600, num_neg: 31687.8400
2019-12-21 14:56:57,005 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, cls_pos_loss: nan, cls_neg_loss: nan, dir_loss_reduced: nan, cls_loss_reduced: nan, loc_loss_reduced: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_pos: 36.3400, num_neg: 63408.7600
2019-12-21 14:56:57,005 - INFO - task : ['bus', 'trailer'], loss: nan, cls_pos_loss: nan, cls_neg_loss: nan, dir_loss_reduced: nan, cls_loss_reduced: nan, loc_loss_reduced: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_pos: 54.0200, num_neg: 63379.1400
2019-12-21 14:56:57,005 - INFO - task : ['barrier'], loss: nan, cls_pos_loss: nan, cls_neg_loss: nan, dir_loss_reduced: nan, cls_loss_reduced: nan, loc_loss_reduced: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_pos: 7.6200, num_neg: 31742.6000
2019-12-21 14:56:57,005 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, cls_pos_loss: nan, cls_neg_loss: nan, dir_loss_reduced: nan, cls_loss_reduced: nan, loc_loss_reduced: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_pos: 11.4400, num_neg: 63487.4600
2019-12-21 14:56:57,005 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, cls_pos_loss: nan, cls_neg_loss: nan, dir_loss_reduced: nan, cls_loss_reduced: nan, loc_loss_reduced: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_pos: 13.4600, num_neg: 63489.3400

And how can I shutdown the gt_database file path output log?

Hopefully for any advice! @poodarchu

poodarchu commented 4 years ago

I've not encounter this problem.

arke812 commented 4 years ago

@muzi2045 Hi, I've also encountered this issue. It seems spconv.SparseSequential doesn't work correctly. Could you let me know how to fix this?

muzi2045 commented 4 years ago

yes, I figure it out that the spconv are the problem, maybe you can try the old version of spconv , such as spconv 1.0

MeyLavie commented 4 years ago

HI @muzi2045, I'm encountering the same problem, did using spconv 1.0 helped?

Thank you

muzi2045 commented 4 years ago

I try to train pointpillars with this repo, and find the problem is not spconv, it's the loss backward problem, you can print the layer weight and find there is some nan value in create_loss function.

MeyLavie commented 4 years ago

@muzi2045 thank you for your response. I see that you tried at first to run cbgs, do you think it's the same problem? Did you manage to fix it?

muzi2045 commented 4 years ago

Yes,I think it's the same problem