Loss became infinite or NaN when the cfg.MODEL.POSITION_HEAD.DEFORM is set to be True

ShengkaiWu commented 3 years ago

Hi, the loss will be infinite or NaN when the cfg.MODEL.POSITION_HEAD.DEFORM is set to be True and the phenomenon will disappear when the cfg.MODEL.POSITION_HEAD.DEFORM is set to be False . I have change the batch size to 8 and adjust the learning rate to BASE_LR: 0.005 according to the linear scaliing rule because only 4 gpus is available to me. Command: python3 projects/PanopticFCN/train.py --config-file projects/PanopticFCN/configs/PanopticFCN-R50-1x.yaml --num-gpus 4 Error: FloatingPointError: Loss became infinite or NaN at iteration=8! loss_dict = {'loss_pos_th': 109.46656090872628, 'loss_pos_st': 2.124839391027178, 'loss_seg_th': 2.777808734348842, 'loss_seg_st': nan}

Have you ever meet this phenomenon?

yanwei-li commented 3 years ago

Hi, I tried your config and find everything runs well and the result is attached below, which will have a performance drop with small batch size. Here is the metrics file. Have you changed other configs, like initial bias?

	PQ	SQ	RQ	#categories
All	38.128	78.211	46.596	133
Things	43.386	80.340	52.538	80
Stuff	30.190	74.997	37.627	53

ShengkaiWu commented 3 years ago

All the other setting is the default setting. Pytorch1.7 and CUDA11.2 is used. The phenomenon is really strange .

yanwei-li commented 3 years ago

It seems strange to me, and I have never met such an error. It seems something got wrong when handing stuff. Maybe you can ignore the initialized bias of stuff by changing this line to for layer in [self.out_inst]:.

DejaYang commented 3 years ago

Hi, I met the similar question with @ShengkaiWu, like this

Traceback (most recent call last): File "/mnt/lj/detectron2/projects/PanopticFCN/train.py", line 142, in launch( File "/mnt/lj/detectron2/detectron2/engine/launch.py", line 82, in launch main_func(*args) File "/mnt/lj/detectron2/projects/PanopticFCN/train.py", line 136, in main return trainer.train() File "/mnt/lj/detectron2/detectron2/engine/defaults.py", line 483, in train super().train(self.start_iter, self.max_iter) File "/mnt/lj/detectron2/detectron2/engine/train_loop.py", line 149, in train self.run_step() File "/mnt/lj/detectron2/detectron2/engine/defaults.py", line 493, in run_step self._trainer.run_step() File "/mnt/lj/detectron2/detectron2/engine/train_loop.py", line 287, in run_step self._write_metrics(loss_dict, data_time) File "/mnt/lj/detectron2/detectron2/engine/train_loop.py", line 302, in _write_metrics SimpleTrainer.write_metrics(loss_dict, data_time, prefix) File "/mnt/lj/detectron2/detectron2/engine/train_loop.py", line 338, in write_metrics raise FloatingPointError( FloatingPointError: Loss became infinite or NaN at iteration=2642! loss_dict = {'loss_pos_th': nan, 'loss_pos_st': nan, 'loss_seg_th': nan, 'loss_seg_st': nan}

training with coco datatset

yanwei-li commented 3 years ago

Hi, maybe you can share the training loss metrics file, which may help locate the error.

dvlab-research / PanopticFCN

Loss became infinite or NaN when the cfg.MODEL.POSITION_HEAD.DEFORM is set to be True #33