Open MrCrightH opened 2 years ago
Thanks for the feedback. Can you clarify which script are you using and which step are you in?
meta_training_coco_resnet101_stage_2.yaml. When I implement this step, it only runs a few steps before reporting an error that Predicted boxes or scores contain Inf/NaN.
Are you trying to reproduce our experiments on coco dataset? This is weird. Can you show me the full running log to better understand what happened in your training?
yes! i have finished the meta_traning_coco stage_1 of the meta_training_coco_multi... , but when i run the meta_training_coco_resnet101_stage_2, it showed as follow:
[10/17 15:21:08] d2.data.datasets.coco INFO: Loading datasets/coco/new_annotations/final_split_non_voc_instances_train2014.json takes 3.08 seconds. [10/17 15:21:09] d2.data.datasets.coco INFO: Loaded 117264 images in COCO format from datasets/coco/new_annotations/final_split_non_voc_instances_train2014.json [10/17 15:21:11] d2.data.build INFO: Removed 54575 images with no usable annotations. 62689 images left. [10/17 15:21:17] d2.data.build INFO: Removed 0 images with no usable annotations. 94195 images left. [10/17 15:21:21] d2.data.build INFO: Distribution of instances among all 80 categories: [36m | category | #instances | category | #instances | category | #instances |
---|---|---|---|---|---|---|
person | 0 | bicycle | 0 | car | 0 | |
motorcycle | 0 | airplane | 0 | bus | 0 | |
train | 0 | truck | 4479 | boat | 0 | |
traffic light | 1522 | fire hydrant | 1172 | stop sign | 1119 | |
parking meter | 514 | bench | 3997 | bird | 0 | |
cat | 0 | dog | 0 | horse | 0 | |
sheep | 0 | cow | 0 | elephant | 3983 | |
bear | 1252 | zebra | 4064 | giraffe | 4530 | |
backpack | 2628 | umbrella | 3678 | handbag | 3022 | |
tie | 2625 | suitcase | 3054 | frisbee | 1352 | |
skis | 2567 | snowboard | 1282 | sports ball | 846 | |
kite | 1346 | baseball bat | 1015 | baseball gl.. | 759 | |
skateboard | 2906 | surfboard | 3389 | tennis racket | 1713 | |
bottle | 0 | wine glass | 2068 | cup | 6238 | |
fork | 2498 | knife | 2765 | spoon | 2040 | |
bowl | 5265 | banana | 3755 | apple | 2271 | |
sandwich | 2962 | orange | 2850 | broccoli | 4319 | |
carrot | 2959 | hot dog | 1794 | pizza | 3848 | |
donut | 4166 | cake | 3448 | chair | 0 | |
couch | 0 | potted plant | 0 | bed | 2985 | |
dining table | 0 | toilet | 3477 | tv | 0 | |
laptop | 2307 | mouse | 927 | remote | 1492 | |
keyboard | 1362 | cell phone | 2625 | microwave | 654 | |
oven | 1332 | toaster | 71 | sink | 2869 | |
refrigerator | 1283 | book | 3851 | clock | 2682 | |
vase | 2874 | scissors | 953 | teddy bear | 3334 | |
hair drier | 116 | toothbrush | 776 | |||
total | 148030 | [0m |
[10/17 15:21:21] d2.data.common INFO: Serializing 94195 elements to byte tensors and concatenating them all ... [10/17 15:21:22] d2.data.common INFO: Serialized dataset takes 50.18 MiB [10/17 15:21:22] meta_faster_rcnn.data.build INFO: Using training sampler TrainingSampler [10/17 15:21:22] fvcore.common.checkpoint INFO: [Checkpointer] Loading from ./output/fsod/meta_training_coco_resnet101_stage_1/model_final.pth ... [10/17 15:21:23] fvcore.common.checkpoint WARNING: Some model parameters or buffers are not found in the checkpoint: [34mproposal_generator.rpn_head.anchor_deltas_cat.{bias, weight}[0m [34mproposal_generator.rpn_head.anchor_deltas_diff.{bias, weight}[0m [34mproposal_generator.rpn_head.cat_fc.0.{bias, weight}[0m [34mproposal_generator.rpn_head.diff_fc.0.{bias, weight}[0m [34mproposal_generator.rpn_head.objectness_logits_cat.{bias, weight}[0m [34mproposal_generator.rpn_head.objectness_logits_diff.{bias, weight}[0m [34mroi_heads.box_predictor.bbox_pred_cor.{bias, weight}[0m [34mroi_heads.box_predictor.bbox_pred_fc.{bias, weight}[0m [34mroi_heads.box_predictor.bbox_pred_gd.{bias, weight}[0m [34mroi_heads.box_predictor.cls_score_gd.{bias, weight}[0m [34mroi_heads.box_predictor.conv_1_gd.weight[0m [34mroi_heads.box_predictor.conv_2_gd.weight[0m [34mroi_heads.box_predictor.conv_3_gd.weight[0m [34mroi_heads.box_predictor.norm.{bias, weight}[0m [10/17 15:21:23] d2.engine.train_loop INFO: Starting training from iteration 0 [10/17 15:22:04] d2.utils.events INFO: eta: 10:42:19 iter: 19 total_loss: 2.142 loss_cls: 1.69 loss_box_reg: 0.3216 loss_rpn_cls: 0.02834 loss_rpn_loc: 0.01064 time: 1.9260 data_time: 0.1603 lr: 0.0001342 max_mem: 19052M [10/17 15:22:43] d2.utils.events INFO: eta: 10:42:48 iter: 39 total_loss: 1.361 loss_cls: 1.111 loss_box_reg: 0.261 loss_rpn_cls: 0.02525 loss_rpn_loc: 0.009109 time: 1.9303 data_time: 0.0777 lr: 0.0001702 max_mem: 19055M [10/17 15:23:22] d2.utils.events INFO: eta: 10:44:28 iter: 59 total_loss: 1.251 loss_cls: 0.8241 loss_box_reg: 0.3086 loss_rpn_cls: 0.02666 loss_rpn_loc: 0.01125 time: 1.9439 data_time: 0.0875 lr: 0.0002062 max_mem: 19055M [10/17 15:24:02] d2.utils.events INFO: eta: 10:46:26 iter: 79 total_loss: 1.308 loss_cls: 0.8696 loss_box_reg: 0.2948 loss_rpn_cls: 0.03027 loss_rpn_loc: 0.01091 time: 1.9531 data_time: 0.0828 lr: 0.0002422 max_mem: 19056M [10/17 15:24:42] d2.utils.events INFO: eta: 10:50:33 iter: 99 total_loss: 0.9825 loss_cls: 0.605 loss_box_reg: 0.3051 loss_rpn_cls: 0.02744 loss_rpn_loc: 0.009897 time: 1.9619 data_time: 0.0792 lr: 0.0002782 max_mem: 19056M [10/17 15:25:22] d2.utils.events INFO: eta: 10:51:56 iter: 119 total_loss: 1.068 loss_cls: 0.637 loss_box_reg: 0.3279 loss_rpn_cls: 0.02772 loss_rpn_loc: 0.008074 time: 1.9676 data_time: 0.0862 lr: 0.0003142 max_mem: 19056M [10/17 15:26:01] d2.utils.events INFO: eta: 10:52:43 iter: 139 total_loss: 0.9612 loss_cls: 0.5965 loss_box_reg: 0.3174 loss_rpn_cls: 0.02713 loss_rpn_loc: 0.009668 time: 1.9712 data_time: 0.0804 lr: 0.0003502 max_mem: 19056M [10/17 15:26:41] d2.utils.events INFO: eta: 10:52:03 iter: 159 total_loss: 0.9253 loss_cls: 0.5459 loss_box_reg: 0.33 loss_rpn_cls: 0.02211 loss_rpn_loc: 0.009601 time: 1.9721 data_time: 0.0765 lr: 0.0003862 max_mem: 19056M [10/17 15:27:21] d2.utils.events INFO: eta: 10:52:04 iter: 179 total_loss: 0.8336 loss_cls: 0.4791 loss_box_reg: 0.3217 loss_rpn_cls: 0.02815 loss_rpn_loc: 0.01103 time: 1.9759 data_time: 0.0812 lr: 0.0004222 max_mem: 19056M [10/17 15:28:01] d2.utils.events INFO: eta: 10:51:21 iter: 199 total_loss: 0.9207 loss_cls: 0.5247 loss_box_reg: 0.3523 loss_rpn_cls: 0.02278 loss_rpn_loc: 0.008193 time: 1.9766 data_time: 0.0805 lr: 0.0004582 max_mem: 19056M [10/17 15:28:41] d2.utils.events INFO: eta: 10:51:00 iter: 219 total_loss: 0.8944 loss_cls: 0.501 loss_box_reg: 0.274 loss_rpn_cls: 0.02453 loss_rpn_loc: 0.01002 time: 1.9773 data_time: 0.0839 lr: 0.0004942 max_mem: 19058M [10/17 15:29:20] d2.utils.events INFO: eta: 10:50:59 iter: 239 total_loss: 0.953 loss_cls: 0.646 loss_box_reg: 0.3021 loss_rpn_cls: 0.02511 loss_rpn_loc: 0.00783 time: 1.9780 data_time: 0.0844 lr: 0.0005302 max_mem: 19058M [10/17 15:30:00] d2.utils.events INFO: eta: 10:51:08 iter: 259 total_loss: 0.9736 loss_cls: 0.6806 loss_box_reg: 0.2666 loss_rpn_cls: 0.02316 loss_rpn_loc: 0.00909 time: 1.9785 data_time: 0.0843 lr: 0.0005662 max_mem: 19058M [10/17 15:30:40] d2.utils.events INFO: eta: 10:50:52 iter: 279 total_loss: 1.277 loss_cls: 1.035 loss_box_reg: 0.2421 loss_rpn_cls: 0.04444 loss_rpn_loc: 0.007862 time: 1.9794 data_time: 0.0851 lr: 0.0006022 max_mem: 19058M [10/17 15:31:20] d2.utils.events INFO: eta: 10:50:21 iter: 299 total_loss: 1.427 loss_cls: 1.128 loss_box_reg: 0.2823 loss_rpn_cls: 0.02622 loss_rpn_loc: 0.008311 time: 1.9800 data_time: 0.0825 lr: 0.0006382 max_mem: 19058M [10/17 15:31:59] d2.utils.events INFO: eta: 10:49:41 iter: 319 total_loss: 2.897 loss_cls: 2.585 loss_box_reg: 0.2703 loss_rpn_cls: 0.03668 loss_rpn_loc: 0.008904 time: 1.9800 data_time: 0.0794 lr: 0.0006742 max_mem: 19059M [10/17 15:32:21] d2.engine.train_loop ERROR: Exception during training: Traceback (most recent call last): File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 149, in train self.run_step() File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step self._trainer.run_step() File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 273, in run_step loss_dict = self.model(data) File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/home/1Tm2/CH/Meta-Faster-R-CNN/meta_faster_rcnn/modeling/fsod/fsod_rcnn.py", line 206, in forward pos_proposals, pos_anchors, pos_pred_objectness_logits, pos_gt_labels, pos_pred_anchor_deltas, pos_gt_boxes = self.proposal_generator(query_images, pos_features, pos_support_features_pool, query_gt_instances) # attention rpn File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/1Tm2/CH/Meta-Faster-R-CNN/meta_faster_rcnn/modeling/fsod/fsod_rpn.py", line 490, in forward proposals = self.predict_proposals( File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(args, kwargs) File "/home/1Tm2/CH/Meta-Faster-R-CNN/meta_faster_rcnn/modeling/fsod/fsod_rpn.py", line 523, in predict_proposals return find_top_rpn_proposals( File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/proposal_utils.py", line 103, in find_top_rpn_proposals raise FloatingPointError( FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.
May I know the model hyper-parameters configs, e.g., the batch size? Did you change the default value?
Due to my limited gpu memory, I had to change the batchsize to 4 and the learning rate to 0.0005. can you please tell me how to adjust it?
############################################ meta_training_coco_resnet101_stage_1.yaml: BASE: "Base-FSOD-C4.yaml" MODEL: WEIGHTS: "/home/1Tm2/CH/Meta-Faster-R-CNN/R-101.pkl" MASK_ON: False RESNETS: DEPTH: 101 BACKBONE: FREEZE_AT: 2 ROI_HEADS: SCORE_THRESH_TEST: 0.0 RPN: PRE_NMS_TOPK_TEST: 12000 POST_NMS_TOPK_TEST: 100 FEWX_BASELINE: True WITH_ALIGNMENT: False OUTPUT_DIR: './output/fsod/meta_training_coco_resnet101_stage_1' DATASETS: TRAIN: ("coco_2014_train_nonvoc",) TEST: ("coco_2014_val",) TEST_SHOTS: (1,2,3,5,10,30) INPUT: FS: SUPPORT_WAY: 2 SUPPORT_SHOT: 30 MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800) MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 600 MAX_SIZE_TEST: 1000 SOLVER: IMS_PER_BATCH: 4 #8 BASE_LR: 0.0005 #0.001 STEPS: (30000, 40000) MAX_ITER: 40001 WARMUP_ITERS: 1000 WARMUP_FACTOR: 0.1 CHECKPOINT_PERIOD: 10000 HEAD_LR_FACTOR: 2.0
########################################## meta_training_coco_resnet101_stage_2.yaml: BASE: "Base-FSOD-C4.yaml" MODEL: WEIGHTS: "./output/fsod/meta_training_coco_resnet101_stage_1/model_final.pth" MASK_ON: False RESNETS: DEPTH: 101 BACKBONE: FREEZE_AT: 2 ROI_HEADS: SCORE_THRESH_TEST: 0.0 RPN: PRE_NMS_TOPK_TEST: 12000 POST_NMS_TOPK_TEST: 100 FEWX_BASELINE: False WITH_ALIGNMENT: False OUTPUT_DIR: './output/fsod/meta_training_coco_resnet101_stage_2' DATASETS: TRAIN: ("coco_2014_train_nonvoc",) TEST: ("coco_2014_val",) TEST_SHOTS: (1,2,3,5,10,30) INPUT: FS: SUPPORT_WAY: 2 SUPPORT_SHOT: 30 MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800) MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 600 MAX_SIZE_TEST: 1000 SOLVER: IMS_PER_BATCH: 4 #8 BASE_LR: 0.001 STEPS: (15000, 20000) MAX_ITER: 20001 WARMUP_ITERS: 500 WARMUP_FACTOR: 0.1 CHECKPOINT_PERIOD: 20001 HEAD_LR_FACTOR: 2.0 TEST: EVAL_PERIOD: 10000
Unfortunately, our model works best with batch_size >=8 in the second step. Using smaller batch_size leads to unstable training. You can try to decrease the BASE_LR, increase the WARMUP_ITERS, decrease the number of SUPPORT_SHOT or other ways to remedy the small batch_size, but the detection accuracy may not be guaranteed.
File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, **kwargs) File "/home/1Tm2/CH/Meta-Faster-R-CNN/meta_faster_rcnn/modeling/fsod/fsod_rpn.py", line 523, in predict_proposals return find_top_rpn_proposals( File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/proposal_utils.py", line 103, in find_top_rpn_proposals raise FloatingPointError( FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.