Runtime error during evaluation

stevehuanghe commented 4 years ago

Dear Ji,

I ran into this runtime error when trying to evaluate the model with pertained checkpoints:

python ./tools/test_net_rel.py --dataset vg --cfg configs/vg/e2e_faster_rcnn_VGG16_8_epochs_vg_v3_default_node_contrastive_loss_w_so_p_aware_margin_point2_so_weight_point5_no_spt.yaml --load_ckpt trained_models/vg_VGG16/model_step62722.pth --output_dir Outputs/vg_VGG16 --multi-gpu-testing --do_val

RuntimeError: Error(s) in loading state_dict for Generalized_RCNN: size mismatch for RelDN.prd_cls_feats.0.weight: copying a param of torch.Size([6144, 12288]) from checkpoint, where the shape is torch.Size([4096, 12288]) in current model. size mismatch for RelDN.prd_cls_feats.0.bias: copying a param of torch.Size([6144]) from checkpoint, where the shape is torch.Size([4096]) in current model.

Would you please help me with this issue? Thank you very much.

heygrandpa commented 4 years ago

I also faced the same problem. I try to change the size of RelDN.prd_cls_feats.0.weight in lib\modeling_rel\reldn_heads.py from (6144, 12288) to (4096, 12288), but I can't get the same evaluation result as the paper. Did you find a solution for the issue?

simonJJJ commented 4 years ago

I also faced the same problem. I try to change the size of RelDN.prd_cls_feats.0.weight in lib\modeling_rel\reldn_heads.py from (6144, 12288) to (4096, 12288), but I can't get the same evaluation result as the paper. Did you find a solution for the issue?

Yes, faced the same. I also changed (6144, 12288) to (4096, 12288), and my SGDET results are 16.01 for R@20, 23.32 for R@50 and 29.53 for R@100. That's actually far from paper's results.

sandeep-ipk commented 4 years ago

@simonJJJ @heygrandpa @stevehuanghe @jz462 Did anyone find a solution to this or is it the fault in the pre-trained model itself?

jz462 commented 4 years ago

Hi everyone,

Sorry for the late reply. I've updated the link which contains a compatible VGG16 that gives a results on par with the paper. You can also download it here. Please let me know if it does not work or if you have further questions.

Ji

tfzhou commented 4 years ago

Hi Ji, Thanks for the great work! The error still exists using updated models. But the ResNeXt model works well.

cao-nv commented 4 years ago

I evaluate the trained VGG16 model on Sdget task on Visual Genome and followings result:

R@20: 20.74 
R@50: 29.36
R@100: 35.95

These results are somewhat different from the result of the paper? Does anyone get the same results?

cao-nv commented 4 years ago

Another problem occurs when I enable multi-gpu-testing inference, an error occurs: AssertionError: Range subprocess failed (exit code: 1). Could you give me a recommendation to solve this problem?

jz462 commented 4 years ago

Hi @cao-nv, Yes I confirm that these are the valid reproduced results. A little suggestion of mine: if you want to compare with our method, these results are definitely OK; if you plan to use our method to obtain scene graphs as features for down-stream tasks, you don't have to struggle with the VGG16 backbone. ResNext is clearly better for your need.

About you multi-gpu issue, you need to make sure the value of CUDA_VISIBLE_DEVICES is equal to the actual GPUs you have on your machine, because our code determines the GPUs by only looking at CUDA_VISIBLE_DEVICES.

Ji

cao-nv commented 4 years ago

Thanks @jz462, For the multi-gpu issue, I share a server with 7 working GPUs with others, so that I often set the number of visible gpus to 2, or 4. Is it ok, or the number of visible GPUs must be 7.

jz462 commented 4 years ago

@cao-nv It should be OK if you do export CUDA_VISIBLE_DEVICES=<g1,g2,...> where g1,g2 are the indices of the GPUs you want to use, and you can set any number of these as you want.

cao-nv commented 4 years ago

I got this annoying error every time the number of visible GPUs is not 1 and multi-gpu-test is enable. Perhaps there is a problem with subprocess, the returncode is 1, but expected 0.

ByZ0e commented 2 years ago

I got this annoying error every time the number of visible GPUs is not 1 and multi-gpu-test is enable. Perhaps there is a problem with subprocess, the returncode is 1, but expected 0.

Hi, did you solve this problem. I met the same error with you. Any suggestions?

cao-nv commented 2 years ago

I got this annoying error every time the number of visible GPUs is not 1 and multi-gpu-test is enable. Perhaps there is a problem with subprocess, the returncode is 1, but expected 0.

Hi, did you solve this problem. I met the same error with you. Any suggestions?

Unfortunately, I didn't found any solution for the issue, so I just moved to other scene graph generation model

luckyyy00 commented 2 years ago

hi Ji，your new trained models in https://drive.google.com/file/d/15w0q3Nuye2ieu_aUNdTS_FNvoVzM4RMF/view use the same detection model with before trained models?

NVIDIA / ContrastiveLosses4VRD

Runtime error during evaluation #10