GuangxingHan / Meta-Faster-R-CNN

Code for AAAI 2022 Oral paper: 'Meta Faster R-CNN: Towards Accurate Few-Shot Object Detection with Attentive Feature Alignment'
82 stars 10 forks source link

NaN:Dear author, thanks for you great work. Currently I am trying to run your code but always report NaN error, the following is the error traceback, can you have a look? Thanks in advance! #15

Open MrCrightH opened 2 years ago

MrCrightH commented 2 years ago
proposals = self.predict_proposals(

File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, **kwargs) File "/home/1Tm2/CH/Meta-Faster-R-CNN/meta_faster_rcnn/modeling/fsod/fsod_rpn.py", line 523, in predict_proposals return find_top_rpn_proposals( File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/proposal_utils.py", line 103, in find_top_rpn_proposals raise FloatingPointError( FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.

GuangxingHan commented 2 years ago

Thanks for the feedback. Can you clarify which script are you using and which step are you in?

MrCrightH commented 2 years ago

meta_training_coco_resnet101_stage_2.yaml. When I implement this step, it only runs a few steps before reporting an error that Predicted boxes or scores contain Inf/NaN.

GuangxingHan commented 2 years ago

Are you trying to reproduce our experiments on coco dataset? This is weird. Can you show me the full running log to better understand what happened in your training?

MrCrightH commented 2 years ago

yes! i have finished the meta_traning_coco stage_1 of the meta_training_coco_multi... , but when i run the meta_training_coco_resnet101_stage_2, it showed as follow:

[10/17 15:21:08] d2.data.datasets.coco INFO: Loading datasets/coco/new_annotations/final_split_non_voc_instances_train2014.json takes 3.08 seconds. [10/17 15:21:09] d2.data.datasets.coco INFO: Loaded 117264 images in COCO format from datasets/coco/new_annotations/final_split_non_voc_instances_train2014.json [10/17 15:21:11] d2.data.build INFO: Removed 54575 images with no usable annotations. 62689 images left. [10/17 15:21:17] d2.data.build INFO: Removed 0 images with no usable annotations. 94195 images left. [10/17 15:21:21] d2.data.build INFO: Distribution of instances among all 80 categories:  category #instances category #instances category #instances
person 0 bicycle 0 car 0
motorcycle 0 airplane 0 bus 0
train 0 truck 4479 boat 0
traffic light 1522 fire hydrant 1172 stop sign 1119
parking meter 514 bench 3997 bird 0
cat 0 dog 0 horse 0
sheep 0 cow 0 elephant 3983
bear 1252 zebra 4064 giraffe 4530
backpack 2628 umbrella 3678 handbag 3022
tie 2625 suitcase 3054 frisbee 1352
skis 2567 snowboard 1282 sports ball 846
kite 1346 baseball bat 1015 baseball gl.. 759
skateboard 2906 surfboard 3389 tennis racket 1713
bottle 0 wine glass 2068 cup 6238
fork 2498 knife 2765 spoon 2040
bowl 5265 banana 3755 apple 2271
sandwich 2962 orange 2850 broccoli 4319
carrot 2959 hot dog 1794 pizza 3848
donut 4166 cake 3448 chair 0
couch 0 potted plant 0 bed 2985
dining table 0 toilet 3477 tv 0
laptop 2307 mouse 927 remote 1492
keyboard 1362 cell phone 2625 microwave 654
oven 1332 toaster 71 sink 2869
refrigerator 1283 book 3851 clock 2682
vase 2874 scissors 953 teddy bear 3334
hair drier 116 toothbrush 776
total 148030 

[10/17 15:21:21] d2.data.common INFO: Serializing 94195 elements to byte tensors and concatenating them all ... [10/17 15:21:22] d2.data.common INFO: Serialized dataset takes 50.18 MiB [10/17 15:21:22] meta_faster_rcnn.data.build INFO: Using training sampler TrainingSampler [10/17 15:21:22] fvcore.common.checkpoint INFO: [Checkpointer] Loading from ./output/fsod/meta_training_coco_resnet101_stage_1/model_final.pth ... [10/17 15:21:23] fvcore.common.checkpoint WARNING: Some model parameters or buffers are not found in the checkpoint: proposal_generator.rpn_head.anchor_deltas_cat.{bias, weight} proposal_generator.rpn_head.anchor_deltas_diff.{bias, weight} proposal_generator.rpn_head.cat_fc.0.{bias, weight} proposal_generator.rpn_head.diff_fc.0.{bias, weight} proposal_generator.rpn_head.objectness_logits_cat.{bias, weight} proposal_generator.rpn_head.objectness_logits_diff.{bias, weight} roi_heads.box_predictor.bbox_pred_cor.{bias, weight} roi_heads.box_predictor.bbox_pred_fc.{bias, weight} roi_heads.box_predictor.bbox_pred_gd.{bias, weight} roi_heads.box_predictor.cls_score_gd.{bias, weight} roi_heads.box_predictor.conv_1_gd.weight roi_heads.box_predictor.conv_2_gd.weight roi_heads.box_predictor.conv_3_gd.weight roi_heads.box_predictor.norm.{bias, weight} [10/17 15:21:23] d2.engine.train_loop INFO: Starting training from iteration 0 [10/17 15:22:04] d2.utils.events INFO: eta: 10:42:19 iter: 19 total_loss: 2.142 loss_cls: 1.69 loss_box_reg: 0.3216 loss_rpn_cls: 0.02834 loss_rpn_loc: 0.01064 time: 1.9260 data_time: 0.1603 lr: 0.0001342 max_mem: 19052M [10/17 15:22:43] d2.utils.events INFO: eta: 10:42:48 iter: 39 total_loss: 1.361 loss_cls: 1.111 loss_box_reg: 0.261 loss_rpn_cls: 0.02525 loss_rpn_loc: 0.009109 time: 1.9303 data_time: 0.0777 lr: 0.0001702 max_mem: 19055M [10/17 15:23:22] d2.utils.events INFO: eta: 10:44:28 iter: 59 total_loss: 1.251 loss_cls: 0.8241 loss_box_reg: 0.3086 loss_rpn_cls: 0.02666 loss_rpn_loc: 0.01125 time: 1.9439 data_time: 0.0875 lr: 0.0002062 max_mem: 19055M [10/17 15:24:02] d2.utils.events INFO: eta: 10:46:26 iter: 79 total_loss: 1.308 loss_cls: 0.8696 loss_box_reg: 0.2948 loss_rpn_cls: 0.03027 loss_rpn_loc: 0.01091 time: 1.9531 data_time: 0.0828 lr: 0.0002422 max_mem: 19056M [10/17 15:24:42] d2.utils.events INFO: eta: 10:50:33 iter: 99 total_loss: 0.9825 loss_cls: 0.605 loss_box_reg: 0.3051 loss_rpn_cls: 0.02744 loss_rpn_loc: 0.009897 time: 1.9619 data_time: 0.0792 lr: 0.0002782 max_mem: 19056M [10/17 15:25:22] d2.utils.events INFO: eta: 10:51:56 iter: 119 total_loss: 1.068 loss_cls: 0.637 loss_box_reg: 0.3279 loss_rpn_cls: 0.02772 loss_rpn_loc: 0.008074 time: 1.9676 data_time: 0.0862 lr: 0.0003142 max_mem: 19056M [10/17 15:26:01] d2.utils.events INFO: eta: 10:52:43 iter: 139 total_loss: 0.9612 loss_cls: 0.5965 loss_box_reg: 0.3174 loss_rpn_cls: 0.02713 loss_rpn_loc: 0.009668 time: 1.9712 data_time: 0.0804 lr: 0.0003502 max_mem: 19056M [10/17 15:26:41] d2.utils.events INFO: eta: 10:52:03 iter: 159 total_loss: 0.9253 loss_cls: 0.5459 loss_box_reg: 0.33 loss_rpn_cls: 0.02211 loss_rpn_loc: 0.009601 time: 1.9721 data_time: 0.0765 lr: 0.0003862 max_mem: 19056M [10/17 15:27:21] d2.utils.events INFO: eta: 10:52:04 iter: 179 total_loss: 0.8336 loss_cls: 0.4791 loss_box_reg: 0.3217 loss_rpn_cls: 0.02815 loss_rpn_loc: 0.01103 time: 1.9759 data_time: 0.0812 lr: 0.0004222 max_mem: 19056M [10/17 15:28:01] d2.utils.events INFO: eta: 10:51:21 iter: 199 total_loss: 0.9207 loss_cls: 0.5247 loss_box_reg: 0.3523 loss_rpn_cls: 0.02278 loss_rpn_loc: 0.008193 time: 1.9766 data_time: 0.0805 lr: 0.0004582 max_mem: 19056M [10/17 15:28:41] d2.utils.events INFO: eta: 10:51:00 iter: 219 total_loss: 0.8944 loss_cls: 0.501 loss_box_reg: 0.274 loss_rpn_cls: 0.02453 loss_rpn_loc: 0.01002 time: 1.9773 data_time: 0.0839 lr: 0.0004942 max_mem: 19058M [10/17 15:29:20] d2.utils.events INFO: eta: 10:50:59 iter: 239 total_loss: 0.953 loss_cls: 0.646 loss_box_reg: 0.3021 loss_rpn_cls: 0.02511 loss_rpn_loc: 0.00783 time: 1.9780 data_time: 0.0844 lr: 0.0005302 max_mem: 19058M [10/17 15:30:00] d2.utils.events INFO: eta: 10:51:08 iter: 259 total_loss: 0.9736 loss_cls: 0.6806 loss_box_reg: 0.2666 loss_rpn_cls: 0.02316 loss_rpn_loc: 0.00909 time: 1.9785 data_time: 0.0843 lr: 0.0005662 max_mem: 19058M [10/17 15:30:40] d2.utils.events INFO: eta: 10:50:52 iter: 279 total_loss: 1.277 loss_cls: 1.035 loss_box_reg: 0.2421 loss_rpn_cls: 0.04444 loss_rpn_loc: 0.007862 time: 1.9794 data_time: 0.0851 lr: 0.0006022 max_mem: 19058M [10/17 15:31:20] d2.utils.events INFO: eta: 10:50:21 iter: 299 total_loss: 1.427 loss_cls: 1.128 loss_box_reg: 0.2823 loss_rpn_cls: 0.02622 loss_rpn_loc: 0.008311 time: 1.9800 data_time: 0.0825 lr: 0.0006382 max_mem: 19058M [10/17 15:31:59] d2.utils.events INFO: eta: 10:49:41 iter: 319 total_loss: 2.897 loss_cls: 2.585 loss_box_reg: 0.2703 loss_rpn_cls: 0.03668 loss_rpn_loc: 0.008904 time: 1.9800 data_time: 0.0794 lr: 0.0006742 max_mem: 19059M [10/17 15:32:21] d2.engine.train_loop ERROR: Exception during training: Traceback (most recent call last): File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 149, in train self.run_step() File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step self._trainer.run_step() File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 273, in run_step loss_dict = self.model(data) File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/home/1Tm2/CH/Meta-Faster-R-CNN/meta_faster_rcnn/modeling/fsod/fsod_rcnn.py", line 206, in forward pos_proposals, pos_anchors, pos_pred_objectness_logits, pos_gt_labels, pos_pred_anchor_deltas, pos_gt_boxes = self.proposal_generator(query_images, pos_features, pos_support_features_pool, query_gt_instances) # attention rpn File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/1Tm2/CH/Meta-Faster-R-CNN/meta_faster_rcnn/modeling/fsod/fsod_rpn.py", line 490, in forward proposals = self.predict_proposals( File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(args, kwargs) File "/home/1Tm2/CH/Meta-Faster-R-CNN/meta_faster_rcnn/modeling/fsod/fsod_rpn.py", line 523, in predict_proposals return find_top_rpn_proposals( File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/proposal_utils.py", line 103, in find_top_rpn_proposals raise FloatingPointError( FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.

GuangxingHan commented 2 years ago

May I know the model hyper-parameters configs, e.g., the batch size? Did you change the default value?

MrCrightH commented 2 years ago

Due to my limited gpu memory, I had to change the batchsize to 4 and the learning rate to 0.0005. can you please tell me how to adjust it?

############################################ meta_training_coco_resnet101_stage_1.yaml: BASE: "Base-FSOD-C4.yaml" MODEL: WEIGHTS: "/home/1Tm2/CH/Meta-Faster-R-CNN/R-101.pkl" MASK_ON: False RESNETS: DEPTH: 101 BACKBONE: FREEZE_AT: 2 ROI_HEADS: SCORE_THRESH_TEST: 0.0 RPN: PRE_NMS_TOPK_TEST: 12000 POST_NMS_TOPK_TEST: 100 FEWX_BASELINE: True WITH_ALIGNMENT: False OUTPUT_DIR: './output/fsod/meta_training_coco_resnet101_stage_1' DATASETS: TRAIN: ("coco_2014_train_nonvoc",) TEST: ("coco_2014_val",) TEST_SHOTS: (1,2,3,5,10,30) INPUT: FS: SUPPORT_WAY: 2 SUPPORT_SHOT: 30 MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800) MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 600 MAX_SIZE_TEST: 1000 SOLVER: IMS_PER_BATCH: 4 #8 BASE_LR: 0.0005 #0.001 STEPS: (30000, 40000) MAX_ITER: 40001 WARMUP_ITERS: 1000 WARMUP_FACTOR: 0.1 CHECKPOINT_PERIOD: 10000 HEAD_LR_FACTOR: 2.0

TEST:

EVAL_PERIOD: 40000

########################################## meta_training_coco_resnet101_stage_2.yaml: BASE: "Base-FSOD-C4.yaml" MODEL: WEIGHTS: "./output/fsod/meta_training_coco_resnet101_stage_1/model_final.pth" MASK_ON: False RESNETS: DEPTH: 101 BACKBONE: FREEZE_AT: 2 ROI_HEADS: SCORE_THRESH_TEST: 0.0 RPN: PRE_NMS_TOPK_TEST: 12000 POST_NMS_TOPK_TEST: 100 FEWX_BASELINE: False WITH_ALIGNMENT: False OUTPUT_DIR: './output/fsod/meta_training_coco_resnet101_stage_2' DATASETS: TRAIN: ("coco_2014_train_nonvoc",) TEST: ("coco_2014_val",) TEST_SHOTS: (1,2,3,5,10,30) INPUT: FS: SUPPORT_WAY: 2 SUPPORT_SHOT: 30 MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800) MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 600 MAX_SIZE_TEST: 1000 SOLVER: IMS_PER_BATCH: 4 #8 BASE_LR: 0.001 STEPS: (15000, 20000) MAX_ITER: 20001 WARMUP_ITERS: 500 WARMUP_FACTOR: 0.1 CHECKPOINT_PERIOD: 20001 HEAD_LR_FACTOR: 2.0 TEST: EVAL_PERIOD: 10000

GuangxingHan commented 2 years ago

Unfortunately, our model works best with batch_size >=8 in the second step. Using smaller batch_size leads to unstable training. You can try to decrease the BASE_LR, increase the WARMUP_ITERS, decrease the number of SUPPORT_SHOT or other ways to remedy the small batch_size, but the detection accuracy may not be guaranteed.