MrCrightH commented 2 years ago

proposals = self.predict_proposals(

File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, **kwargs) File "/home/1Tm2/CH/Meta-Faster-R-CNN/meta_faster_rcnn/modeling/fsod/fsod_rpn.py", line 523, in predict_proposals return find_top_rpn_proposals( File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/proposal_utils.py", line 103, in find_top_rpn_proposals raise FloatingPointError( FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.

GuangxingHan commented 2 years ago

Thanks for the feedback. Can you clarify which script are you using and which step are you in?

MrCrightH commented 2 years ago

meta_training_coco_resnet101_stage_2.yaml. When I implement this step, it only runs a few steps before reporting an error that Predicted boxes or scores contain Inf/NaN.

GuangxingHan commented 2 years ago

Are you trying to reproduce our experiments on coco dataset? This is weird. Can you show me the full running log to better understand what happened in your training?

MrCrightH commented 2 years ago

yes! i have finished the meta_traning_coco stage_1 of the meta_training_coco_multi... , but when i run the meta_training_coco_resnet101_stage_2, it showed as follow:

[10/17 15:21:08] d2.data.datasets.coco INFO: Loading datasets/coco/new_annotations/final_split_non_voc_instances_train2014.json takes 3.08 seconds. [10/17 15:21:09] d2.data.datasets.coco INFO: Loaded 117264 images in COCO format from datasets/coco/new_annotations/final_split_non_voc_instances_train2014.json [10/17 15:21:11] d2.data.build INFO: Removed 54575 images with no usable annotations. 62689 images left. [10/17 15:21:17] d2.data.build INFO: Removed 0 images with no usable annotations. 94195 images left. [10/17 15:21:21] d2.data.build INFO: Distribution of instances among all 80 categories: [36m	category	#instances	category	#instances	category	#instances
person	0	bicycle	0	car	0
motorcycle	0	airplane	0	bus	0
train	0	truck	4479	boat	0
traffic light	1522	fire hydrant	1172	stop sign	1119
parking meter	514	bench	3997	bird	0
cat	0	dog	0	horse	0
sheep	0	cow	0	elephant	3983
bear	1252	zebra	4064	giraffe	4530
backpack	2628	umbrella	3678	handbag	3022
tie	2625	suitcase	3054	frisbee	1352
skis	2567	snowboard	1282	sports ball	846
kite	1346	baseball bat	1015	baseball gl..	759
skateboard	2906	surfboard	3389	tennis racket	1713
bottle	0	wine glass	2068	cup	6238
fork	2498	knife	2765	spoon	2040
bowl	5265	banana	3755	apple	2271
sandwich	2962	orange	2850	broccoli	4319
carrot	2959	hot dog	1794	pizza	3848
donut	4166	cake	3448	chair	0
couch	0	potted plant	0	bed	2985
dining table	0	toilet	3477	tv	0
laptop	2307	mouse	927	remote	1492
keyboard	1362	cell phone	2625	microwave	654
oven	1332	toaster	71	sink	2869
refrigerator	1283	book	3851	clock	2682
vase	2874	scissors	953	teddy bear	3334
hair drier	116	toothbrush	776
total	148030					[0m

[10/17 15:21:21] d2.data.common INFO: Serializing 94195 elements to byte tensors and concatenating them all ... [10/17 15:21:22] d2.data.common INFO: Serialized dataset takes 50.18 MiB [10/17 15:21:22] meta_faster_rcnn.data.build INFO: Using training sampler TrainingSampler [10/17 15:21:22] fvcore.common.checkpoint INFO: [Checkpointer] Loading from ./output/fsod/meta_training_coco_resnet101_stage_1/model_final.pth ... [10/17 15:21:23] fvcore.common.checkpoint WARNING: Some model parameters or buffers are not found in the checkpoint: [34mproposal_generator.rpn_head.anchor_deltas_cat.{bias, weight}[0m [34mproposal_generator.rpn_head.anchor_deltas_diff.{bias, weight}[0m [34mproposal_generator.rpn_head.cat_fc.0.{bias, weight}[0m [34mproposal_generator.rpn_head.diff_fc.0.{bias, weight}[0m [34mproposal_generator.rpn_head.objectness_logits_cat.{bias, weight}[0m [34mproposal_generator.rpn_head.objectness_logits_diff.{bias, weight}[0m [34mroi_heads.box_predictor.bbox_pred_cor.{bias, weight}[0m [34mroi_heads.box_predictor.bbox_pred_fc.{bias, weight}[0m [34mroi_heads.box_predictor.bbox_pred_gd.{bias, weight}[0m [34mroi_heads.box_predictor.cls_score_gd.{bias, weight}[0m [34mroi_heads.box_predictor.conv_1_gd.weight[0m [34mroi_heads.box_predictor.conv_2_gd.weight[0m [34mroi_heads.box_predictor.conv_3_gd.weight[0m [34mroi_heads.box_predictor.norm.{bias, weight}[0m [10/17 15:21:23] d2.engine.train_loop INFO: Starting training from iteration 0 [10/17 15:22:04] d2.utils.events INFO: eta: 10:42:19 iter: 19 total_loss: 2.142 loss_cls: 1.69 loss_box_reg: 0.3216 loss_rpn_cls: 0.02834 loss_rpn_loc: 0.01064 time: 1.9260 data_time: 0.1603 lr: 0.0001342 max_mem: 19052M [10/17 15:22:43] d2.utils.events INFO: eta: 10:42:48 iter: 39 total_loss: 1.361 loss_cls: 1.111 loss_box_reg: 0.261 loss_rpn_cls: 0.02525 loss_rpn_loc: 0.009109 time: 1.9303 data_time: 0.0777 lr: 0.0001702 max_mem: 19055M [10/17 15:23:22] d2.utils.events INFO: eta: 10:44:28 iter: 59 total_loss: 1.251 loss_cls: 0.8241 loss_box_reg: 0.3086 loss_rpn_cls: 0.02666 loss_rpn_loc: 0.01125 time: 1.9439 data_time: 0.0875 lr: 0.0002062 max_mem: 19055M [10/17 15:24:02] d2.utils.events INFO: eta: 10:46:26 iter: 79 total_loss: 1.308 loss_cls: 0.8696 loss_box_reg: 0.2948 loss_rpn_cls: 0.03027 loss_rpn_loc: 0.01091 time: 1.9531 data_time: 0.0828 lr: 0.0002422 max_mem: 19056M [10/17 15:24:42] d2.utils.events INFO: eta: 10:50:33 iter: 99 total_loss: 0.9825 loss_cls: 0.605 loss_box_reg: 0.3051 loss_rpn_cls: 0.02744 loss_rpn_loc: 0.009897 time: 1.9619 data_time: 0.0792 lr: 0.0002782 max_mem: 19056M [10/17 15:25:22] d2.utils.events INFO: eta: 10:51:56 iter: 119 total_loss: 1.068 loss_cls: 0.637 loss_box_reg: 0.3279 loss_rpn_cls: 0.02772 loss_rpn_loc: 0.008074 time: 1.9676 data_time: 0.0862 lr: 0.0003142 max_mem: 19056M [10/17 15:26:01] d2.utils.events INFO: eta: 10:52:43 iter: 139 total_loss: 0.9612 loss_cls: 0.5965 loss_box_reg: 0.3174 loss_rpn_cls: 0.02713 loss_rpn_loc: 0.009668 time: 1.9712 data_time: 0.0804 lr: 0.0003502 max_mem: 19056M [10/17 15:26:41] d2.utils.events INFO: eta: 10:52:03 iter: 159 total_loss: 0.9253 loss_cls: 0.5459 loss_box_reg: 0.33 loss_rpn_cls: 0.02211 loss_rpn_loc: 0.009601 time: 1.9721 data_time: 0.0765 lr: 0.0003862 max_mem: 19056M [10/17 15:27:21] d2.utils.events INFO: eta: 10:52:04 iter: 179 total_loss: 0.8336 loss_cls: 0.4791 loss_box_reg: 0.3217 loss_rpn_cls: 0.02815 loss_rpn_loc: 0.01103 time: 1.9759 data_time: 0.0812 lr: 0.0004222 max_mem: 19056M [10/17 15:28:01] d2.utils.events INFO: eta: 10:51:21 iter: 199 total_loss: 0.9207 loss_cls: 0.5247 loss_box_reg: 0.3523 loss_rpn_cls: 0.02278 loss_rpn_loc: 0.008193 time: 1.9766 data_time: 0.0805 lr: 0.0004582 max_mem: 19056M [10/17 15:28:41] d2.utils.events INFO: eta: 10:51:00 iter: 219 total_loss: 0.8944 loss_cls: 0.501 loss_box_reg: 0.274 loss_rpn_cls: 0.02453 loss_rpn_loc: 0.01002 time: 1.9773 data_time: 0.0839 lr: 0.0004942 max_mem: 19058M [10/17 15:29:20] d2.utils.events INFO: eta: 10:50:59 iter: 239 total_loss: 0.953 loss_cls: 0.646 loss_box_reg: 0.3021 loss_rpn_cls: 0.02511 loss_rpn_loc: 0.00783 time: 1.9780 data_time: 0.0844 lr: 0.0005302 max_mem: 19058M [10/17 15:30:00] d2.utils.events INFO: eta: 10:51:08 iter: 259 total_loss: 0.9736 loss_cls: 0.6806 loss_box_reg: 0.2666 loss_rpn_cls: 0.02316 loss_rpn_loc: 0.00909 time: 1.9785 data_time: 0.0843 lr: 0.0005662 max_mem: 19058M [10/17 15:30:40] d2.utils.events INFO: eta: 10:50:52 iter: 279 total_loss: 1.277 loss_cls: 1.035 loss_box_reg: 0.2421 loss_rpn_cls: 0.04444 loss_rpn_loc: 0.007862 time: 1.9794 data_time: 0.0851 lr: 0.0006022 max_mem: 19058M [10/17 15:31:20] d2.utils.events INFO: eta: 10:50:21 iter: 299 total_loss: 1.427 loss_cls: 1.128 loss_box_reg: 0.2823 loss_rpn_cls: 0.02622 loss_rpn_loc: 0.008311 time: 1.9800 data_time: 0.0825 lr: 0.0006382 max_mem: 19058M [10/17 15:31:59] d2.utils.events INFO: eta: 10:49:41 iter: 319 total_loss: 2.897 loss_cls: 2.585 loss_box_reg: 0.2703 loss_rpn_cls: 0.03668 loss_rpn_loc: 0.008904 time: 1.9800 data_time: 0.0794 lr: 0.0006742 max_mem: 19059M [10/17 15:32:21] d2.engine.train_loop ERROR: Exception during training: Traceback (most recent call last): File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 149, in train self.run_step() File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step self._trainer.run_step() File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 273, in run_step loss_dict = self.model(data) File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/home/1Tm2/CH/Meta-Faster-R-CNN/meta_faster_rcnn/modeling/fsod/fsod_rcnn.py", line 206, in forward pos_proposals, pos_anchors, pos_pred_objectness_logits, pos_gt_labels, pos_pred_anchor_deltas, pos_gt_boxes = self.proposal_generator(query_images, pos_features, pos_support_features_pool, query_gt_instances) # attention rpn File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/1Tm2/CH/Meta-Faster-R-CNN/meta_faster_rcnn/modeling/fsod/fsod_rpn.py", line 490, in forward proposals = self.predict_proposals( File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(args, kwargs) File "/home/1Tm2/CH/Meta-Faster-R-CNN/meta_faster_rcnn/modeling/fsod/fsod_rpn.py", line 523, in predict_proposals return find_top_rpn_proposals( File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/proposal_utils.py", line 103, in find_top_rpn_proposals raise FloatingPointError( FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.

GuangxingHan commented 2 years ago

May I know the model hyper-parameters configs, e.g., the batch size? Did you change the default value?

MrCrightH commented 2 years ago

Due to my limited gpu memory, I had to change the batchsize to 4 and the learning rate to 0.0005. can you please tell me how to adjust it?

############################################ meta_training_coco_resnet101_stage_1.yaml： BASE: "Base-FSOD-C4.yaml" MODEL: WEIGHTS: "/home/1Tm2/CH/Meta-Faster-R-CNN/R-101.pkl" MASK_ON: False RESNETS: DEPTH: 101 BACKBONE: FREEZE_AT: 2 ROI_HEADS: SCORE_THRESH_TEST: 0.0 RPN: PRE_NMS_TOPK_TEST: 12000 POST_NMS_TOPK_TEST: 100 FEWX_BASELINE: True WITH_ALIGNMENT: False OUTPUT_DIR: './output/fsod/meta_training_coco_resnet101_stage_1' DATASETS: TRAIN: ("coco_2014_train_nonvoc",) TEST: ("coco_2014_val",) TEST_SHOTS: (1,2,3,5,10,30) INPUT: FS: SUPPORT_WAY: 2 SUPPORT_SHOT: 30 MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800) MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 600 MAX_SIZE_TEST: 1000 SOLVER: IMS_PER_BATCH: 4 #8 BASE_LR: 0.0005 #0.001 STEPS: (30000, 40000) MAX_ITER: 40001 WARMUP_ITERS: 1000 WARMUP_FACTOR: 0.1 CHECKPOINT_PERIOD: 10000 HEAD_LR_FACTOR: 2.0

TEST:

EVAL_PERIOD: 40000

########################################## meta_training_coco_resnet101_stage_2.yaml： BASE: "Base-FSOD-C4.yaml" MODEL: WEIGHTS: "./output/fsod/meta_training_coco_resnet101_stage_1/model_final.pth" MASK_ON: False RESNETS: DEPTH: 101 BACKBONE: FREEZE_AT: 2 ROI_HEADS: SCORE_THRESH_TEST: 0.0 RPN: PRE_NMS_TOPK_TEST: 12000 POST_NMS_TOPK_TEST: 100 FEWX_BASELINE: False WITH_ALIGNMENT: False OUTPUT_DIR: './output/fsod/meta_training_coco_resnet101_stage_2' DATASETS: TRAIN: ("coco_2014_train_nonvoc",) TEST: ("coco_2014_val",) TEST_SHOTS: (1,2,3,5,10,30) INPUT: FS: SUPPORT_WAY: 2 SUPPORT_SHOT: 30 MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800) MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 600 MAX_SIZE_TEST: 1000 SOLVER: IMS_PER_BATCH: 4 #8 BASE_LR: 0.001 STEPS: (15000, 20000) MAX_ITER: 20001 WARMUP_ITERS: 500 WARMUP_FACTOR: 0.1 CHECKPOINT_PERIOD: 20001 HEAD_LR_FACTOR: 2.0 TEST: EVAL_PERIOD: 10000

GuangxingHan commented 2 years ago

Unfortunately, our model works best with batch_size >=8 in the second step. Using smaller batch_size leads to unstable training. You can try to decrease the BASE_LR, increase the WARMUP_ITERS, decrease the number of SUPPORT_SHOT or other ways to remedy the small batch_size, but the detection accuracy may not be guaranteed.

GuangxingHan / Meta-Faster-R-CNN

NaN:Dear author, thanks for you great work. Currently I am trying to run your code but always report NaN error, the following is the error traceback, can you have a look? Thanks in advance! #15

TEST:

EVAL_PERIOD: 40000