Closed ShengkaiWu closed 2 years ago
Hi, I tried your config and find everything runs well and the result is attached below, which will have a performance drop with small batch size. Here is the metrics file. Have you changed other configs, like initial bias?
PQ | SQ | RQ | #categories | |
---|---|---|---|---|
All | 38.128 | 78.211 | 46.596 | 133 |
Things | 43.386 | 80.340 | 52.538 | 80 |
Stuff | 30.190 | 74.997 | 37.627 | 53 |
All the other setting is the default setting. Pytorch1.7 and CUDA11.2 is used. The phenomenon is really strange .
It seems strange to me, and I have never met such an error. It seems something got wrong when handing stuff. Maybe you can ignore the initialized bias of stuff by changing this line to for layer in [self.out_inst]:
.
Hi, I met the similar question with @ShengkaiWu, like this
Traceback (most recent call last): File "/mnt/lj/detectron2/projects/PanopticFCN/train.py", line 142, in
launch( File "/mnt/lj/detectron2/detectron2/engine/launch.py", line 82, in launch main_func(*args) File "/mnt/lj/detectron2/projects/PanopticFCN/train.py", line 136, in main return trainer.train() File "/mnt/lj/detectron2/detectron2/engine/defaults.py", line 483, in train super().train(self.start_iter, self.max_iter) File "/mnt/lj/detectron2/detectron2/engine/train_loop.py", line 149, in train self.run_step() File "/mnt/lj/detectron2/detectron2/engine/defaults.py", line 493, in run_step self._trainer.run_step() File "/mnt/lj/detectron2/detectron2/engine/train_loop.py", line 287, in run_step self._write_metrics(loss_dict, data_time) File "/mnt/lj/detectron2/detectron2/engine/train_loop.py", line 302, in _write_metrics SimpleTrainer.write_metrics(loss_dict, data_time, prefix) File "/mnt/lj/detectron2/detectron2/engine/train_loop.py", line 338, in write_metrics raise FloatingPointError( FloatingPointError: Loss became infinite or NaN at iteration=2642! loss_dict = {'loss_pos_th': nan, 'loss_pos_st': nan, 'loss_seg_th': nan, 'loss_seg_st': nan}
training with coco datatset
Hi, maybe you can share the training loss metrics file, which may help locate the error.
Hi, the loss will be infinite or NaN when the cfg.MODEL.POSITION_HEAD.DEFORM is set to be True and the phenomenon will disappear when the cfg.MODEL.POSITION_HEAD.DEFORM is set to be False . I have change the batch size to 8 and adjust the learning rate to BASE_LR: 0.005 according to the linear scaliing rule because only 4 gpus is available to me. Command: python3 projects/PanopticFCN/train.py --config-file projects/PanopticFCN/configs/PanopticFCN-R50-1x.yaml --num-gpus 4 Error: FloatingPointError: Loss became infinite or NaN at iteration=8! loss_dict = {'loss_pos_th': 109.46656090872628, 'loss_pos_st': 2.124839391027178, 'loss_seg_th': 2.777808734348842, 'loss_seg_st': nan}
Have you ever meet this phenomenon?