Training loss - Githubissues

wb-finalking commented 4 years ago

Excuse me. I use "python3 train.py configs/car_auto_T3_train_train_config configs/car_auto_T3_train_config" for training on KITTI dataset. But the loss is showed as follow: cls:nan, loc:nan, reg:nan, loss: nan Class_0: recall=0.957521, prec=0.969269, mAP=0.000000, loc=0.000000 x=0.0000 y=0.0000 z=0.0000 l=0.0000 h=0.0000 w=0.0000 y=0.0000 Class_1: recall=0.145631, prec=0.034169, mAP=0.000000, loc=1.847784 x=1.7500 y=2.4132 z=2.5065 l=0.2367 h=0.1836 w=0.1582 y=5.6863 Class_2: recall=0.003945, prec=0.019802, mAP=0.000000, loc=3.094378 x=0.8403 y=2.0149 z=12.8085 l=0.0637 h=0.0760 w=0.2339 y=5.6234 Class_3: recall=0.002632, prec=0.004808, mAP=0.000000, loc=0.000000 x=0.0000 y=0.0000 z=0.0000 l=0.0000 h=0.0000 w=0.0000 y=0.0000 I dont know why. If you have faced the same error, please point out how to fix that. Thank you very much!

WeijingShi commented 4 years ago

Hi @wb-finalking Could you help to provide more info?

Did you change config or train_config? e.g. num of GPU and batch size
On which epoch does the Nan occur? If it's the first epoch, can you try to clear the checkpoint and start again?

wb-finalking commented 4 years ago

Hi @WeijingShi I change the batchsize to 1 and gpu num to 1. When I decrease the learning rate, cls loss and loc loss became normal, but reg loss still be NAN.

WeijingShi commented 4 years ago

@wb-finalking batch size 1 is pretty small. If you further reduce the learning rate, do you still get the NaN?

wb-finalking commented 4 years ago

I solve this problem by increase batch size to 2. Thanks for your patient answer.

WeijingShi commented 4 years ago

Good know that it works out!

WeijingShi / Point-GNN

Training loss #30