Fail to reproduce the reported performance of VCTree model on the SGCLS task

❓ Questions and Help

What I have done: I set MODEL.ROI_RELATION_HEAD.PREDICTOR to be VCTreePredictor, and set the CONTEXT_HIDDEN_DIM in configs/e2e_relation_X_101_32_8_FPN_1x.yaml to be 1024. Following Training Example 1, I gave a command like the following one: python tools/relation_train_net.py --config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml" MODEL.ROI_RELATION_HEAD.USE_GT_BOX True MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL False MODEL.ROI_RELATION_HEAD.PREDICTOR VCTreePredictor SOLVER.IMS_PER_BATCH 12 TEST.IMS_PER_BATCH 1 DTYPE "float16" SOLVER.MAX_ITER 50000 SOLVER.VAL_PERIOD 2000 SOLVER.CHECKPOINT_PERIOD 2000 GLOVE_DIR data/ MODEL.PRETRAINED_DETECTOR_CKPT checkpoints/pretrained_faster_rcnn/model_final.pth OUTPUT_DIR checkpoints/vctree-sgcls-exmp Notice that I only have 1 GPU, so I removed the multi-GPU options and used a batch size of 12. Also, I'm actually saving the checkpoints in other directories (not under Scene-Graph-Benchmark.pytorch), but I guess it's OK. What problem I have got: There are gradient overflow reports like:

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1024.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 128.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 64.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.0

Besides, the maskrcnn reports MAPs near 0 in the first evaluation step of detector (I wonder if I'm actually training it from scratch?). After several iters of training, part of the parameters turns to remain 0:

2021-06-28 08:28:02,984 maskrcnn_benchmark INFO: roi_heads.relation.union_feature_extractor.feature_extractor.pooler.reduce_channel.0.bias: 0.00042, (torch.Size([256]))
2021-06-28 08:28:02,985 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.pos_embed.0.bias: 0.00001, (torch.Size([32]))
2021-06-28 08:28:02,985 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.overlap_embed.0.weight: 0.00000, (torch.Size([128, 6]))
2021-06-28 08:28:02,985 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.overlap_embed.0.bias: 0.00000, (torch.Size([128]))
2021-06-28 08:28:02,985 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.overlap_embed.1.weight: 0.00000, (torch.Size([128]))
2021-06-28 08:28:02,985 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.overlap_embed.1.bias: 0.00000, (torch.Size([128]))
2021-06-28 08:28:02,985 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.box_embed.0.weight: 0.00000, (torch.Size([128, 9]))
2021-06-28 08:28:02,985 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.box_embed.0.bias: 0.00000, (torch.Size([128]))
2021-06-28 08:28:02,986 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.box_embed.1.weight: 0.00000, (torch.Size([128]))
2021-06-28 08:28:02,986 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.box_embed.1.bias: 0.00000, (torch.Size([128]))
2021-06-28 08:28:02,986 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.obj_reduce.weight: 0.00000, (torch.Size([128, 4096]))
2021-06-28 08:28:02,986 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.obj_reduce.bias: 0.00000, (torch.Size([128]))
2021-06-28 08:28:02,986 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.emb_reduce.weight: 0.00000, (torch.Size([128, 200]))
2021-06-28 08:28:02,986 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.emb_reduce.bias: 0.00000, (torch.Size([128]))
2021-06-28 08:28:02,986 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.score_pre.weight: 0.00000, (torch.Size([1024, 512]))
2021-06-28 08:28:02,986 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.score_pre.bias: 0.00000, (torch.Size([1024]))
2021-06-28 08:28:02,987 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.score_sub.weight: 0.00000, (torch.Size([1024, 1024]))
2021-06-28 08:28:02,987 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.score_sub.bias: 0.00000, (torch.Size([1024]))
2021-06-28 08:28:02,987 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.score_obj.weight: 0.00000, (torch.Size([1024, 1024]))
2021-06-28 08:28:02,987 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.score_obj.bias: 0.00000, (torch.Size([1024]))
2021-06-28 08:28:02,987 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.vision_prior.weight: 0.00000, (torch.Size([1, 3073]))
2021-06-28 08:28:02,987 maskrcnn_benchmark INFO: roi_heads.relation.predictor.context_layer.vision_prior.bias: 0.00000, (torch.Size([1]))

Despite the above possible problems, I kept training the model. But after 50000 iters, I got a test result like R@100 37.68 on the test set. I have noticed that all the examples are about the motif-based model, could you please show the right procedure of reproducing the reported performance of the vctree model (42.77 | 46.67 | 47.64)?

KaihuaTang / Scene-Graph-Benchmark.pytorch

Fail to reproduce the reported performance of VCTree model on the SGCLS task #136

❓ Questions and Help