KaihuaTang / Scene-Graph-Benchmark.pytorch

A new codebase for popular Scene Graph Generation methods (2020). Visualization & Scene Graph Extraction on custom images/datasets are provided. It's also a PyTorch implementation of paper “Unbiased Scene Graph Generation from Biased Training CVPR 2020”
MIT License
1.03k stars 228 forks source link

Fail to reproduce the reported performance of VCTree model on the SGCLS task #136

Open wishforgood opened 3 years ago

wishforgood commented 3 years ago

❓ Questions and Help

What I have done: I set MODEL.ROI_RELATION_HEAD.PREDICTOR to be VCTreePredictor, and set the CONTEXT_HIDDEN_DIM in configs/e2e_relation_X_101_32_8_FPN_1x.yaml to be 1024. Following Training Example 1, I gave a command like the following one: python tools/relation_train_net.py --config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml" MODEL.ROI_RELATION_HEAD.USE_GT_BOX True MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL False MODEL.ROI_RELATION_HEAD.PREDICTOR VCTreePredictor SOLVER.IMS_PER_BATCH 12 TEST.IMS_PER_BATCH 1 DTYPE "float16" SOLVER.MAX_ITER 50000 SOLVER.VAL_PERIOD 2000 SOLVER.CHECKPOINT_PERIOD 2000 GLOVE_DIR data/ MODEL.PRETRAINED_DETECTOR_CKPT checkpoints/pretrained_faster_rcnn/model_final.pth OUTPUT_DIR checkpoints/vctree-sgcls-exmp Notice that I only have 1 GPU, so I removed the multi-GPU options and used a batch size of 12. Also, I'm actually saving the checkpoints in other directories (not under Scene-Graph-Benchmark.pytorch), but I guess it's OK. What problem I have got: There are gradient overflow reports like:

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1024.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 128.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 64.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.0

Besides, the maskrcnn reports MAPs near 0 in the first evaluation step of detector (I wonder if I'm actually training it from scratch?). After several iters of training, part of the parameters turns to remain 0:

2021-06-28 08:28:02,984 maskrcnn_benchmark INFO: roi_heads.relation.union_feature_extractor.feature_extractor.pooler.reduce_channel.0.bias: 0.00042, (torch.Size([256]))
2021-06-28 08:28:02,985 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.pos_embed.0.bias: 0.00001, (torch.Size([32]))
2021-06-28 08:28:02,985 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.overlap_embed.0.weight: 0.00000, (torch.Size([128, 6]))
2021-06-28 08:28:02,985 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.overlap_embed.0.bias: 0.00000, (torch.Size([128]))
2021-06-28 08:28:02,985 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.overlap_embed.1.weight: 0.00000, (torch.Size([128]))
2021-06-28 08:28:02,985 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.overlap_embed.1.bias: 0.00000, (torch.Size([128]))
2021-06-28 08:28:02,985 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.box_embed.0.weight: 0.00000, (torch.Size([128, 9]))
2021-06-28 08:28:02,985 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.box_embed.0.bias: 0.00000, (torch.Size([128]))
2021-06-28 08:28:02,986 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.box_embed.1.weight: 0.00000, (torch.Size([128]))
2021-06-28 08:28:02,986 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.box_embed.1.bias: 0.00000, (torch.Size([128]))
2021-06-28 08:28:02,986 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.obj_reduce.weight: 0.00000, (torch.Size([128, 4096]))
2021-06-28 08:28:02,986 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.obj_reduce.bias: 0.00000, (torch.Size([128]))
2021-06-28 08:28:02,986 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.emb_reduce.weight: 0.00000, (torch.Size([128, 200]))
2021-06-28 08:28:02,986 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.emb_reduce.bias: 0.00000, (torch.Size([128]))
2021-06-28 08:28:02,986 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.score_pre.weight: 0.00000, (torch.Size([1024, 512]))
2021-06-28 08:28:02,986 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.score_pre.bias: 0.00000, (torch.Size([1024]))
2021-06-28 08:28:02,987 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.score_sub.weight: 0.00000, (torch.Size([1024, 1024]))
2021-06-28 08:28:02,987 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.score_sub.bias: 0.00000, (torch.Size([1024]))
2021-06-28 08:28:02,987 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.score_obj.weight: 0.00000, (torch.Size([1024, 1024]))
2021-06-28 08:28:02,987 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.score_obj.bias: 0.00000, (torch.Size([1024]))
2021-06-28 08:28:02,987 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.vision_prior.weight: 0.00000, (torch.Size([1, 3073]))
2021-06-28 08:28:02,987 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.vision_prior.bias: 0.00000, (torch.Size([1]))

Despite the above possible problems, I kept training the model. But after 50000 iters, I got a test result like R@100 37.68 on the test set. I have noticed that all the examples are about the motif-based model, could you please show the right procedure of reproducing the reported performance of the vctree model (42.77 | 46.67 | 47.64)?