Open coldgemini opened 6 years ago
in the config file I changed the num_gpus to 1 and in command line I didn't put num_gpus option.
NUM_GPUS: 1
Did you try to lower the learning rate? In my case, I lower the BASE_LR: 0.01 to 0.001 then the problem is solved. I think too large learning rate makes loss value diverging.
@nonstop1962 hi, should I change the others in the solver after lowering the learning rate
it also happen to me 👍 NUM_GPUS: 1 BASE_LR: 0.0025 GAMMA: 0.1
CRITICAL train.py: 84: Loss is NaN, exiting...
I meet the same problem.
I met same problem in training retinanet. I used 1 GPU, and I changed lr in 1/8 scale. But still this problem occurs.
Pyou th century y
without modifying much of the config file, during training, the loss suddenly go to NaN, no clue what's going on.
MODEL: TYPE: retinanet CONV_BODY: FPN.add_fpn_ResNet50_conv5_body NUM_CLASSES: 81 NUM_GPUS: 8 SOLVER: WEIGHT_DECAY: 0.0001 LR_POLICY: steps_with_decay BASE_LR: 0.01 GAMMA: 0.1 MAX_ITER: 180000 STEPS: [0, 120000, 160000] FPN: FPN_ON: True MULTILEVEL_RPN: True RPN_MAX_LEVEL: 7 RPN_MIN_LEVEL: 3 COARSEST_STRIDE: 128 EXTRA_CONV_LEVELS: True RETINANET: RETINANET_ON: True NUM_CONVS: 4 ASPECT_RATIOS: (1.0, 2.0, 0.5) SCALES_PER_OCTAVE: 3 ANCHOR_SCALE: 4 LOSS_GAMMA: 2.0 LOSS_ALPHA: 0.25 TRAIN: WEIGHTS: /home/xiang/Tmp/root-tmp/detectron-download-cache/ImageNetPretrained/MSRA/R-50.pkl DATASETS: ('coco_2014_train', 'coco_2014_valminusminival') SCALES: (800,) MAX_SIZE: 1333 RPN_STRADDLE_THRESH: -1 # default 0 TEST: DATASETS: ('coco_2014_minival',) SCALES: (800,) MAX_SIZE: 1333 NMS: 0.5 RPN_PRE_NMS_TOP_N: 10000 # Per FPN level RPN_POST_NMS_TOP_N: 2000 OUTPUT_DIR: .
517128, "lr": 0.009467, "mb_qsize": 64, "mem": 6809, "retnet_bg_num": 59379198.000000, "retnet_fg_num": 227.000000, "retnet_loss_bbox_fpn3": 0.042339, "retnet_loss_bbox_fpn4": 0.096430, "retnet_loss_bbox_fpn5": 0.089828, "retnet_loss_bbox_fpn6": 0.097975, "retnet_loss_bbox_fpn7": 0.044798, "time": 0.381559} json_stats: {"eta": "19:02:46", "fl_fpn3": 0.170789, "fl_fpn4": 0.135453, "fl_fpn5": 0.130441, "fl_fpn6": 0.126242, "fl_fpn7": 0.073969, "iter": 480, "loss": 1.397539, "lr": 0.009733, "mb_qsize": 64, "mem": 6809, "retnet_bg_num": 59383742.000000, "retnet_fg_num": 223.500000, "retnet_loss_bbox_fpn3": 0.096644, "retnet_loss_bbox_fpn4": 0.075062, "retnet_loss_bbox_fpn5": 0.070927, "retnet_loss_bbox_fpn6": 0.065684, "retnet_loss_bbox_fpn7": 0.046369, "time": 0.381942} json_stats: {"eta": "19:03:40", "fl_fpn3": 0.204816, "fl_fpn4": 0.181845, "fl_fpn5": 0.283243, "fl_fpn6": 0.217639, "fl_fpn7": 0.069032, "iter": 500, "loss": 1.493364, "lr": 0.010000, "mb_qsize": 64, "mem": 6809, "retnet_bg_num": 59377872.000000, "retnet_fg_num": 285.500000, "retnet_loss_bbox_fpn3": 0.080886, "retnet_loss_bbox_fpn4": 0.068787, "retnet_loss_bbox_fpn5": 0.124007, "retnet_loss_bbox_fpn6": 0.089983, "retnet_loss_bbox_fpn7": 0.024744, "time": 0.382287} json_stats: {"eta": "19:04:18", "fl_fpn3": 0.146942, "fl_fpn4": 0.273595, "fl_fpn5": 0.228620, "fl_fpn6": 0.129255, "fl_fpn7": 0.079194, "iter": 520, "loss": 1.509807, "lr": 0.010000, "mb_qsize": 64, "mem": 6809, "retnet_bg_num": 59374364.000000, "retnet_fg_num": 274.500000, "retnet_loss_bbox_fpn3": 0.058743, "retnet_loss_bbox_fpn4": 0.116239, "retnet_loss_bbox_fpn5": 0.088597, "retnet_loss_bbox_fpn6": 0.054436, "retnet_loss_bbox_fpn7": 0.028831, "time": 0.382542} json_stats: {"eta": "19:04:46", "fl_fpn3": 0.167519, "fl_fpn4": 0.214319, "fl_fpn5": 0.279508, "fl_fpn6": 0.114835, "fl_fpn7": 0.139848, "iter": 540, "loss": 1.424404, "lr": 0.010000, "mb_qsize": 64, "mem": 6809, "retnet_bg_num": 59371518.000000, "retnet_fg_num": 302.500000, "retnet_loss_bbox_fpn3": 0.063819, "retnet_loss_bbox_fpn4": 0.104069, "retnet_loss_bbox_fpn5": 0.110527, "retnet_loss_bbox_fpn6": 0.060638, "retnet_loss_bbox_fpn7": 0.048626, "time": 0.382741} json_stats: {"eta": "19:05:03", "fl_fpn3": 0.080854, "fl_fpn4": 0.258065, "fl_fpn5": 0.147684, "fl_fpn6": 0.319110, "fl_fpn7": 0.061460, "iter": 560, "loss": 1.426891, "lr": 0.010000, "mb_qsize": 64, "mem": 6809, "retnet_bg_num": 59386144.000000, "retnet_fg_num": 200.000000, "retnet_loss_bbox_fpn3": 0.029722, "retnet_loss_bbox_fpn4": 0.093204, "retnet_loss_bbox_fpn5": 0.067769, "retnet_loss_bbox_fpn6": 0.143126, "retnet_loss_bbox_fpn7": 0.034299, "time": 0.382879} json_stats: {"eta": "19:04:12", "fl_fpn3": 0.180389, "fl_fpn4": 0.257520, "fl_fpn5": 0.182866, "fl_fpn6": 0.140181, "fl_fpn7": 0.164458, "iter": 580, "loss": 1.476250, "lr": 0.010000, "mb_qsize": 64, "mem": 6809, "retnet_bg_num": 59385000.000000, "retnet_fg_num": 193.000000, "retnet_loss_bbox_fpn3": 0.032252, "retnet_loss_bbox_fpn4": 0.085684, "retnet_loss_bbox_fpn5": 0.083485, "retnet_loss_bbox_fpn6": 0.075452, "retnet_loss_bbox_fpn7": 0.086457, "time": 0.382634} CRITICAL train_net.py: 236: Loss is NaN, exiting... INFO loader.py: 126: Stopping enqueue thread INFO loader.py: 113: Stopping mini-batch loading thread INFO loader.py: 113: Stopping mini-batch loading thread INFO loader.py: 113: Stopping mini-batch loading thread INFO loader.py: 113: Stopping mini-batch loading thread