dvlab-research / PanopticFCN

Fully Convolutional Networks for Panoptic Segmentation (CVPR2021 Oral)
Apache License 2.0
393 stars 53 forks source link

Loss became infinite or NaN when the cfg.MODEL.POSITION_HEAD.DEFORM is set to be True #33

Closed ShengkaiWu closed 2 years ago

ShengkaiWu commented 3 years ago

Hi, the loss will be infinite or NaN when the cfg.MODEL.POSITION_HEAD.DEFORM is set to be True and the phenomenon will disappear when the cfg.MODEL.POSITION_HEAD.DEFORM is set to be False . I have change the batch size to 8 and adjust the learning rate to BASE_LR: 0.005 according to the linear scaliing rule because only 4 gpus is available to me. Command: python3 projects/PanopticFCN/train.py --config-file projects/PanopticFCN/configs/PanopticFCN-R50-1x.yaml --num-gpus 4 Error: FloatingPointError: Loss became infinite or NaN at iteration=8! loss_dict = {'loss_pos_th': 109.46656090872628, 'loss_pos_st': 2.124839391027178, 'loss_seg_th': 2.777808734348842, 'loss_seg_st': nan}

Have you ever meet this phenomenon?

yanwei-li commented 3 years ago

Hi, I tried your config and find everything runs well and the result is attached below, which will have a performance drop with small batch size. Here is the metrics file. Have you changed other configs, like initial bias?

PQ SQ RQ #categories
All 38.128 78.211 46.596 133
Things 43.386 80.340 52.538 80
Stuff 30.190 74.997 37.627 53
ShengkaiWu commented 3 years ago

All the other setting is the default setting. Pytorch1.7 and CUDA11.2 is used. The phenomenon is really strange .

yanwei-li commented 3 years ago

It seems strange to me, and I have never met such an error. It seems something got wrong when handing stuff. Maybe you can ignore the initialized bias of stuff by changing this line to for layer in [self.out_inst]:.

DejaYang commented 3 years ago

Hi, I met the similar question with @ShengkaiWu, like this

Traceback (most recent call last): File "/mnt/lj/detectron2/projects/PanopticFCN/train.py", line 142, in launch( File "/mnt/lj/detectron2/detectron2/engine/launch.py", line 82, in launch main_func(*args) File "/mnt/lj/detectron2/projects/PanopticFCN/train.py", line 136, in main return trainer.train() File "/mnt/lj/detectron2/detectron2/engine/defaults.py", line 483, in train super().train(self.start_iter, self.max_iter) File "/mnt/lj/detectron2/detectron2/engine/train_loop.py", line 149, in train self.run_step() File "/mnt/lj/detectron2/detectron2/engine/defaults.py", line 493, in run_step self._trainer.run_step() File "/mnt/lj/detectron2/detectron2/engine/train_loop.py", line 287, in run_step self._write_metrics(loss_dict, data_time) File "/mnt/lj/detectron2/detectron2/engine/train_loop.py", line 302, in _write_metrics SimpleTrainer.write_metrics(loss_dict, data_time, prefix) File "/mnt/lj/detectron2/detectron2/engine/train_loop.py", line 338, in write_metrics raise FloatingPointError( FloatingPointError: Loss became infinite or NaN at iteration=2642! loss_dict = {'loss_pos_th': nan, 'loss_pos_st': nan, 'loss_seg_th': nan, 'loss_seg_st': nan}

training with coco datatset

yanwei-li commented 3 years ago

Hi, maybe you can share the training loss metrics file, which may help locate the error.