Predicted boxes or scores contain Inf/NaN. Training has diverged.

michaelku1 commented 2 years ago

Just wondering if anyone has experienced the same issue while training clipart at batch size=16, lr = 0.01 during mutual learning (as stated in the title)? When I tested with batch size = 1 there seemed to be no problem. Initial thought was the cause of high learning rate (https://github.com/facebookresearch/detectron2/issues/1128), but all datasets were trained at lr=0.01 as from the paper. The bug was caused by the following lines:

yujheli commented 2 years ago

This happens to me when I set a large weight factor for discriminator. Try 0.05 or less for the discriminator weight to see if this helps.

michaelku1 commented 2 years ago

Thanks for the quick reply, I will give it a try.

michaelku1 commented 2 years ago

So I have tried lower discriminator loss weights but still got the same bug, I will look further into it.

yujheli commented 2 years ago

I see. Let me analyze this if this happens to me on the local machine. I did not have this problem when I use 0.05 as the weight of the discriminator while this happens to me sometimes when it is set as 0.1 when I use 0.02 as the learning rate.

michaelku1 commented 2 years ago

just to double check, this is my current config param settings for clipart, which follows paper's numbers:

yujheli commented 2 years ago

@michaelku1 Where did you set the discriminator weight? I don't see this parameter in your config? You manually edit it in the code?

michaelku1 commented 2 years ago

@michaelku1 Where did you set the discriminator weight? I don't see this parameter in your config? You manually edit it in the code?

I passed it at the cmd line arguments like SEMISUPNET.DIS_LOSS_WEIGHT 0.05. Btw, this problem did not occur when I tried to debug it with bs=1, so when I set pdb at the line for a nan/inf condition it just never entered pdb, so I was never able to reproduce the error, but I will keep trying to increase the batch size for a single gpu to see if I can reproduce the error.

yujheli commented 2 years ago

@michaelku1 I think what should work. When does Inf/NaN happen to you? Like how many iterations have been run until this happen? This can happen to me when I run more than 50k iterations while the best performance usually appear in the 20k-30k iterations.

michaelku1 commented 2 years ago

So the error occured at iterations>14000 for clipart and >39000 for watercolor. I think my datasets are fine since I was able to train them on yjliu's unbiased teacher (by inlcuding weak-strong branch and a domain discriminator), but with suboptimal performance (obvoiusly your code is much more optimized).

yujheli commented 2 years ago

I think it makes some sense for the iterations, yet I think I have to see if the current code is consistent as the version I run in Meta cluster using d2go (https://github.com/facebookresearch/d2go). Please allow me few days for this.

michaelku1 commented 2 years ago

I think it makes some sense for the iterations, yet I think I have to see if the current code is consistent as the version I run in Meta cluster using d2go (https://github.com/facebookresearch/d2go). Please allow me few days for this.

Many Thanks, I will look into it also.

michaelku1 commented 2 years ago

It's training well when I use 0.01 for domain loss weight, which makes sense as domain losses are high for source and target during mutual learning, will close this issue for now.

yujheli commented 2 years ago

@michaelku1 Did you get good performance using 0.01 for domain loss weight? The best performance usually appears when using weight > 0.05 even though it is easy to diverge.

michaelku1 commented 2 years ago

Performance is poor so far (e.g watercolor has AP around 29 for the best performance which is at iter=20k). Is it possible that the pre-trained weights be released?

yujheli commented 2 years ago

@michaelku1 You mean AP@0.5 right? not AP@[0.5-0.95].

michaelku1 commented 2 years ago

So I was reporting the leftmost AP value (there are three) in the log file which shows “AP,” not AP@0.5.

yujheli commented 2 years ago

@michaelku1 You should look at AP50 which is what we report for domain adaptation benchmarks. The AP in the logfile is COCO format using AP@0.5-0.95.

michaelku1 commented 2 years ago

I see. In that case the numbers are indeed close to the papers. I just realized for coco evaluator AP50 is the correct metric to look at for pascal voc benchmarks. Thank you. Btw, for watercolor AP50 is 50.8 @ dis loss weight = 0.01, iter~=29000

yujheli commented 2 years ago

@michaelku1 I upload the internal prod code in prob_lib/ which I run in the internal server. You can see if there is a big difference bwtween them. Also, you are using 16 batch size right?

michaelku1 commented 2 years ago

@michaelku1 I upload the internal prod code in prob_lib/ which I run in the internal server. You can see if there is a big difference bwtween them. Also, you are using 16 batch size right?

Thanks for the upload. Yes I was using bs=16 but was using a lower dis loss weight (=0.01). For watercolor and clipart the AP50 accuracy curve performance peaks around iters=18k (AP50=41.35) and 20k (AP50=55.45), . For cityscapes (AP50=38.41 @ iters=60k) it seemed like it could improve even more for more iterations (>60k). Interestingly, when I trained the model on a different set of machines with gradient accumulation the divergence problem did not occur. Now I am probably going to try 0.1 and see if performance can be improved.

yujheli commented 2 years ago

@michaelku1 I think dis loss weight >= 0.05 is necessary (at least I got my best results using either 0.05 or 0.1). Also, are you using bbox threshold = 0.8 instead of 0.7 right? 0.7 will lead to bad performance while the original config I upload is 0.7. I updated the configs.

michaelku1 commented 2 years ago

@michaelku1 I think dis loss weight >= 0.05 is necessary (at least I got my best results using either 0.05 or 0.1). Also, are you using bbox threshold = 0.8 instead of 0.7 right? 0.7 will lead to bad performance while the original config I upload is 0.7. I updated the configs.

Yes, I am using 0.8. I followed the exact setting as the paper except for the loss dis weight.

michaelku1 commented 2 years ago

I am thinking about reopeninng this issue since I have not been able to train without divergence on clipart and watercolor datasets, so was thinking to make this open for discussion. So far clipart and watercolor are at best 7 points away and 4 points away from the numbers stated in the paper. There may be room for better accuracy if the models did not diverge (models diverged at iters = 20000/25000). Current problem is that diss_loss_weight is what causes training to diverge. I am currently trying to train with an increasing dis_loss_weight instead to see if it can effectively prevent model from diverging at early iterations. (Currently training my models on 8x RTX A5000)

yujheli commented 2 years ago

I recently borrow local machine with 8 GPU cards and will try to test the code locally as well.

helq2612 commented 2 years ago

Hi, I have trained the voc-> clipart with the default setting (8 GPUs, default configurations), and the loss gets nan after iter=21k (the SEMISUPNET.DIS_LOSS_WEIGHT is set to 0.1 as default).

I also trained another vision, in which SEMISUPNET.DIS_LOSS_WEIGHT=0.01. The loss is converged, but the performance is decreased. The best performance is achieved at iteration=25k and AP50=36.7128. After that the performance is gradually reduced to AP50=29.7400 at iteration=100k.

I will try SEMISUPNET.DIS_LOSS_WEIGHT=0.05, to see if I can get better performance.

helq2612 commented 2 years ago

Training got diverged after 25K iterations (with SEMISUPNET.DIS_LOSS_WEIGHT=0.05). The best performance is AP50=36.3788. The error message is here:

File "/project/codes/adaptive_teacher/adapteacher/modeling/proposal_generator/rpn.py", line 53, in forward
--
anchors, pred_objectness_logits, pred_anchor_deltas, images.image_sizes
File "/project/codes/detectron2/detectron2/modeling/proposal_generator/rpn.py", line 511, in predict_proposals
self.training,
File "/project/codes/detectron2/detectron2/modeling/proposal_generator/proposal_utils.py", line 109, in find_top_rpn_proposals
"Predicted boxes or scores contain Inf/NaN. Training has diverged."
FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.

I will try SEMISUPNET.DIS_LOSS_WEIGHT=0.025 next.

helq2612 commented 2 years ago

Hi @michaelku1 , are you using MAX_SIZE_TRAIN=1200, MIN_SIZE_TRAIN=(600,), or the default settings in the config file (see discussion here https://github.com/facebookresearch/adaptive_teacher/issues/23#issue-1291276957)? Could you share your config file and the bash script? Thank you!

@michaelku1 I think dis loss weight >= 0.05 is necessary (at least I got my best results using either 0.05 or 0.1). Also, are you using bbox threshold = 0.8 instead of 0.7 right? 0.7 will lead to bad performance while the original config I upload is 0.7. I updated the configs.

Yes, I am using 0.8. I followed the exact setting as the paper except for the loss dis weight.

michaelku1 commented 2 years ago

Hi @michaelku1 , are you using MAX_SIZE_TRAIN=1200, MIN_SIZE_TRAIN=(600,), or the default settings in the config file (see discussion here #23 (comment))? Could you share your config file and the bash script? Thank you!

@michaelku1 I think dis loss weight >= 0.05 is necessary (at least I got my best results using either 0.05 or 0.1). Also, are you using bbox threshold = 0.8 instead of 0.7 right? 0.7 will lead to bad performance while the original config I upload is 0.7. I updated the configs.

Yes, I am using 0.8. I followed the exact setting as the paper except for the loss dis weight.

I think this is something I have missed (I trained with multi-scale, along with max_size=1333). I will update the config and will try it again. Thanks for your reminder.

yujheli commented 2 years ago

I found that there might be a bug when I try to fix the distributed error in https://github.com/facebookresearch/adaptive_teacher/issues/5

The discriminator is supposed to be randomly initialized after burn-in yet it is trained during the burn-in.

yujheli commented 2 years ago

@michaelku1 @helq2612 Can you both re-clone the updated code? The updated code I can run for 55k iterations (including 25k burn-in) using lr=0.04 dis_weight=0.1 without getting NaN so far.

Yet I only use 3 GPU with 6 batch_size and also get sub-optimal performance:

CUDA_VISIBLE_DEVICES=1,2,3 python -W ignore train_net.py       --num-gpus 3       --config configs/faster_rcnn_R101_cross_clipart.yaml      OUTPUT_DIR output/exp_clipart     SOLVER.IMG_PER_BATCH_LABEL 6 SOLVER.IMG_PER_BATCH_UNLABEL 6

michaelku1 commented 2 years ago

@michaelku1 @helq2612 Can you both re-clone the updated code? The updated code I can run for 55k iterations (including 25k burn-in) using lr=0.04 dis_weight=0.1 without getting NaN so far.

Yet I only use 3 GPU with 6 batch_size and also get sub-optimal performance:
CUDA_VISIBLE_DEVICES=1,2,3 python -W ignore train_net.py       --num-gpus 3       --config configs/faster_rcnn_R101_cross_clipart.yaml      OUTPUT_DIR output/exp_clipart     SOLVER.IMG_PER_BATCH_LABEL 6 SOLVER.IMG_PER_BATCH_UNLABEL 6

Thanks for the quick fix. Will give it a try.

helq2612 commented 2 years ago

Thank you! @yujheli I am training on it now with the updated files.

CUDA_VISIBLE_DEVICES=1,2,3,4 python -W ignore train_net.py       --num-gpus 4      --config configs/faster_rcnn_R101_cross_clipart.yaml      OUTPUT_DIR output/exp_clipart     SOLVER.IMG_PER_BATCH_LABEL 16 SOLVER.IMG_PER_BATCH_UNLABEL 16

But I am using the default 20K burn-in, lr=0.04 and dis_weight=0.1 that specified in the config file.

The model performance after 20K burn-in on clipart is AP50=18.0567 (teacher model). After 21K and 22K iterations, the student model performances are AP50=26.0047 and AP50=30.8676.

I will update my results tomorrow.

helq2612 commented 2 years ago

Sorry to bother you again, @yujheli . But the training is still diverged.

[07/11 16:00:47 d2.utils.events]:  eta: 1 day, 20:50:28  iter: 27618  total_loss: 2.31  loss_cls: 0.09012  loss_box_reg: 0.147  loss_rpn_cls: 0.04217  loss_rpn_loc: 0.12  loss_cls_pseudo: 0.05643  loss_box_reg_pseudo: 0.146  loss_rpn_cls_pseudo: 0.02843  loss_rpn_loc_pseudo: 0.1136  loss_D_img_s: 0.6456  loss_D_img_t: 0.8727  time: 2.2315  data_time: 0.2156  lr: 0.04  max_mem: 20318M

...
File "projects/adaptive_teacher/adapteacher/engine/trainer.py", line 605, in run_step_full_semisup
    record_all_unlabel_data, _, _, _ = self.model(
...
File "projects/adaptive_teacher/adapteacher/modeling/proposal_generator/rpn.py", line 52, in forward
    proposals = self.predict_proposals(
  File "projects/detectron2/detectron2/modeling/proposal_generator/rpn.py", line 503, in predict_proposals
    return find_top_rpn_proposals(
File "projects/detectron2/detectron2/modeling/proposal_generator/proposal_utils.py", line 108, in find_top_rpn_proposals
    raise FloatingPointError(
FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.

Looks like the generated pseudo labels contain Inf/Nan.

yujheli commented 2 years ago

@michaelku1 @helq2612 Using the config at: https://github.com/facebookresearch/adaptive_teacher/issues/26 I can get 45.6 AP@50 before getting NAN.

onkarkris commented 2 years ago

@michaelku1 @helq2612 Using the config at: #26 I can get 45.6 AP@50 before getting NAN.

I can get only 41.27AP@50 at 90k iteration using config file provided with the code "faster_rcnn_R101_cross_clipart_b4.yaml" NAN issue is appearing at iter: 154499. Now I am checking if I can reproduce results by usinf config #26

onkarkris commented 2 years ago

@michaelku1 @helq2612 Using the config at: #26 I can get 45.6 AP@50 before getting NAN.

I can get only 41.27AP@50 at 90k iteration using config file provided with the code "faster_rcnn_R101_cross_clipart_b4.yaml" NAN issue is appearing at iter: 154499. Now I am checking if I can reproduce results by usinf config #26

I can get 45.9 AP@50 with config #26

Weijiang-Xiong commented 1 year ago

initialized

Hello, is this still the case in the latest version of the code? I see the "build_discriminator()" is commented, does it mean it's no longer required?

https://github.com/facebookresearch/adaptive_teacher/blob/6f50dfd78857e8830afe6004add3df7eae0477ca/adapteacher/engine/trainer.py#L526

yp000925 commented 1 year ago

Hi, I also met the problem of "FloatingPointError: Predicted boxes or scores contain Inf/NaN" but at the very first stage (i.e. Iter 1). I want to train from scratch for VOC-to-clipart. I used the default config as you provided in fatser_rcnn_R101_cross_clipart.yaml. I found that if i substitute the MODEL.WEIGHTS with your checkpoint (VOC2Clip_lr001_best.pth) instead of the "detectron2://ImageNetPretrained/MSRA/R-101.pkl", it works. I am wondering whether it is because the pre-trained weights? Any suggestion will be really appreciated. Thanks.

Weijiang-Xiong commented 1 year ago

Hi, I also met the problem of "FloatingPointError: Predicted boxes or scores contain Inf/NaN" but at the very first stage (i.e. Iter 1). I want to train from scratch for VOC-to-clipart. I used the default config as you provided in fatser_rcnn_R101_cross_clipart.yaml. I found that if i substitute the MODEL.WEIGHTS with your checkpoint (VOC2Clip_lr001_best.pth) instead of the "detectron2://ImageNetPretrained/MSRA/R-101.pkl", it works. I am wondering whether it is because the pre-trained weights? Any suggestion will be really appreciated. Thanks.

I used this idea in my own implementation, and in my case, the inf or nan happens in the regression loss. The reason is that, a bad teacher model could predict many 0s, both for position and scale, even if you filter the predictions with a score threshold. If your code does not have something to filter 0-sized boxes, when calculating the box loss (my case is GIOU), the loss value will be invalid. Normally people wouldn't expect that a box could have 0 size when implementing the loss, because gt won't contain this kind, but pseudo labels might...

yp000925 commented 1 year ago

Hi, I also met the problem of "FloatingPointError: Predicted boxes or scores contain Inf/NaN" but at the very first stage (i.e. Iter 1). I want to train from scratch for VOC-to-clipart. I used the default config as you provided in fatser_rcnn_R101_cross_clipart.yaml. I found that if i substitute the MODEL.WEIGHTS with your checkpoint (VOC2Clip_lr001_best.pth) instead of the "detectron2://ImageNetPretrained/MSRA/R-101.pkl", it works. I am wondering whether it is because the pre-trained weights? Any suggestion will be really appreciated. Thanks.

I used this idea in my own implementation, and in my case, the inf or nan happens in the regression loss. The reason is that, a bad teacher model could predict many 0s, both for position and scale, even if you filter the predictions with a score threshold. If your code does not have something to filter 0-sized boxes, when calculating the box loss (my case is GIOU), the loss value will be invalid. Normally people wouldn't expect that a box could have 0 size when implementing the loss, because gt won't contain this kind, but pseudo labels might...

Many thanks for your reply. Have you solved that issue in your implementation? Could you please give some instructions on that?

anranbixin commented 1 year ago

Hello, could you please share the configuration file you ran out? I also had the following problems during the experiment (the problem occurred in the first iteration) : FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged. After trying through the problem set, changing the weights and updating the virtual machine environment, the problem was still not resolved. What is your operating environment? I tried single gpu and multi-GPU (4). My experimental environment is python3.8,torch=1.9,cuda=11.6,detectron2=0.5

facebookresearch / adaptive_teacher

Predicted boxes or scores contain Inf/NaN. Training has diverged. #9