Artanic30 / HOICLIP

CVPR 2023 Accepted Paper HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models
51 stars 7 forks source link

why is the following happening? RuntimeError: Error(s) in loading state_dict for GEN_VLKT: #13

Open hwuidue opened 11 months ago

hwuidue commented 11 months ago

Traceback (most recent call last): File "main.py", line 597, in main(args) File "main.py", line 433, in main model_without_ddp.load_state_dict(checkpoint['model'], strict=True) File "C:\Users\wanni\anaconda3\envs\hoiclip\lib\site-packages\torch\nn\modules\module.py", line 2152, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for GEN_VLKT:

Artanic30 commented 10 months ago

It seems the checkpoint is incomplete or missing, could you provide more error logs?

hwuidue commented 10 months ago

Thank you very much for your reply!In the visualization_hico.sh file, using the pretrained model parameters provided by --pretrained params/detr-r50-pre-2branch-hico.pth, the following error is always displayed, Looking forward to your reply.

use clip text encoder to init classifier weight


VISUALIZATION


number of params: 42089226 init dataloader train contains 37633 images and 117871 annotations val contains 9546 images and 0 annotations rare:138, non-rare:462 val contains 9546 images and 0 annotations rare:138, non-rare:462 dataloader finished model Traceback (most recent call last): File "main.py", line 597, in main(args) File "main.py", line 433, in main model_without_ddp.load_state_dict(checkpoint['model'], strict=True) File "C:\Users\wanni\anaconda3\envs\hoiclip\lib\site-packages\torch\nn\modules\module.py", line 2152, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for GEN_VLKT: Missing key(s) in state_dict: "logit_scale", "obj_logit_scale", "verb2hoi_proj", "verb2hoi_proj_eval", "query_embed_h.weight", "query_embed_o.weight", "pos_guided_embedd.weight", "inter2verb.layers.0.weight", "inter2verb .layers.0.bias", "inter2verb.layers.1.weight", "inter2verb.layers.1.bias", "inter2verb.layers.2.weight", "inter2verb.layers.2.bias", "clip_model.positional_embedding", "clip_model.text_projection", "clip_model.logit_scale", "cli p_model.visual.class_embedding", "clip_model.visual.positional_embedding", "clip_model.visual.proj", "clip_model.visual.conv1.weight", "clip_model.visual.ln_pre.weight", "clip_model.visual.ln_pre.bias", "clip_model.visual.transf ormer.resblocks.0.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.0.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.0.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.0.attn.out_proj.bi as", "clip_model.visual.transformer.resblocks.0.ln_1.weight", "clip_model.visual.transformer.resblocks.0.ln_1.bias", "clip_model.visual.transformer.resblocks.0.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.0.mlp.c_f c.bias", "clip_model.visual.transformer.resblocks.0.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.0.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.0.ln_2.weight", "clip_model.visual.transformer.resbloc ks.0.ln_2.bias", "clip_model.visual.transformer.resblocks.1.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.1.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.1.attn.out_proj.weight", "clip_model.visua l.transformer.resblocks.1.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.1.ln_1.weight", "clip_model.visual.transformer.resblocks.1.ln_1.bias", "clip_model.visual.transformer.resblocks.1.mlp.c_fc.weight", "clip_mo del.visual.transformer.resblocks.1.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.1.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.1.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.1.ln_2.weigh t", "clip_model.visual.transformer.resblocks.1.ln_2.bias", "clip_model.visual.transformer.resblocks.2.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.2.attn.in_proj_bias", "clip_model.visual.transformer.resblocks. 2.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.2.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.2.ln_1.weight", "clip_model.visual.transformer.resblocks.2.ln_1.bias", "clip_model.visual.transfor mer.resblocks.2.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.2.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.2.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.2.mlp.c_proj.bias", "clip_model .visual.transformer.resblocks.2.ln_2.weight", "clip_model.visual.transformer.resblocks.2.ln_2.bias", "clip_model.visual.transformer.resblocks.3.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.3.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.3.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.3.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.3.ln_1.weight", "clip_model.visual.transformer.resblocks .3.ln_1.bias", "clip_model.visual.transformer.resblocks.3.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.3.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.3.mlp.c_proj.weight", "clip_model.visual.transformer .resblocks.3.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.3.ln_2.weight", "clip_model.visual.transformer.resblocks.3.ln_2.bias", "clip_model.visual.transformer.resblocks.4.attn.in_proj_weight", "clip_model.visual.t ransformer.resblocks.4.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.4.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.4.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.4.ln_1.weight" , "clip_model.visual.transformer.resblocks.4.ln_1.bias", "clip_model.visual.transformer.resblocks.4.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.4.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.4.mlp.c_pr oj.weight", "clip_model.visual.transformer.resblocks.4.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.4.ln_2.weight",..

Artanic30 commented 10 months ago

Maybe you only load the pretrained DETR weights, not our HOICLIP weights. Do you set args.resume to one of our released models and make sure the args.resume points to a valid directory with a checkpoint_last.pth.

hwuidue commented 10 months ago

Loading checkpoint_default.pth still has the following error: RuntimeError: Error(s) in loading state_dict for GEN_VLKT: Unexpected key(s) in state_dict: "transformer.clip_interaction_decoder.layers.0.self_attn.in_proj_weight", "transformer.clip_interaction_decoder.layers.0.self_attn.in_proj_bias", "transformer.clip_interaction_decoder.lay ers.0.self_attn.out_proj.weight", "transformer.clip_interaction_decoder.layers.0.self_attn.out_proj.bias", "transformer.clip_interaction_decoder.layers.0.multihead_attn.in_proj_weight", "transformer.clip_interaction_decoder.laye rs.0.multihead_attn.in_proj_bias", "transformer.clip_interaction_decoder.layers.0.multihead_attn.out_proj.weight", "transformer.clip_interaction_decoder.layers.0.multihead_attn.out_proj.bias", "transformer.clip_interaction_decod er.layers.0.linear1.weight", "transformer.clip_interaction_decoder.layers.0.linear1.bias", "transformer.clip_interaction_decoder.layers.0.linear2.weight", "transformer.clip_interaction_decoder.layers.0.linear2.bias", "transforme r.clip_interaction_decoder.layers.0.norm1.weight", "transformer.clip_interaction_decoder.layers.0.norm1.bias", "transformer.clip_interaction_decoder.layers.0.norm2.weight", "transformer.clip_interaction_decoder.layers.0.norm2.bi as", "transformer.clip_interaction_decoder.layers.0.norm3.weight", "transformer.clip_interaction_decoder.layers.0.norm3.bias", "transformer.clip_interaction_decoder.layers.0.norm4.weight", "transformer.clip_interaction_decoder.l ayers.0.norm4.bias", "transformer.clip_interaction_decoder.layers.1.self_attn.in_proj_weight", "transformer.clip_interaction_decoder.layers.1.self_attn.in_proj_bias", "transformer.clip_interaction_decoder.layers.1.selfattn.out proj.weight", "transformer.clip_interaction_decoder.layers.1.self_attn.out_proj.bias", "transformer.clip_interaction_decoder.layers.1.multihead_attn.in_proj_weight", "transformer.clip_interaction_decoder.layers.1.multihead_attn. in_proj_bias", "transformer.clip_interaction_decoder.layers.1.multihead_attn.out_proj.weight", "transformer.clip_interaction_decoder.layers.1.multihead_attn.out_proj.bias", "transformer.clip_interaction_decoder.layers.1.linear1. weight", "transformer.clip_interaction_decoder.layers.1.linear1.bias", "transformer.clip_interaction_decoder.layers.1.linear2.weight", "transformer.clip_interaction_decoder.layers.1.linear2.bias", "transformer.clip_interaction_d ecoder.layers.1.norm1.weight", "transformer.clip_interaction_decoder.layers.1.norm1.bias", "transformer.clip_interaction_decoder.layers.1.norm2.weight", "transformer.clip_interaction_decoder.layers.1.norm2.bias", "transformer.cl ip_interaction_decoder.layers.1.norm3.weight", "transformer.clip_interaction_decoder.layers.1.norm3.bias", "transformer.clip_interaction_decoder.layers.1.norm4.weight", "transformer.clip_interaction_decoder.layers.1.norm4.bias", "transformer.clip_interaction_decoder.layers.2.self_attn.in_proj_weight", "transformer.clip_interaction_decoder.layers.2.self_attn.in_proj_bias", "transformer.clip_interaction_decoder.layers.2.self_attn.out_proj.weight", "trans former.clip_interaction_decoder.layers.2.self_attn.out_proj.bias", "transformer.clip_interaction_decoder.layers.2.multihead_attn.in_proj_weight", "transformer.clip_interaction_decoder.layers.2.multihead_attn.in_proj_bias", "tran sformer.clip_interaction_decoder.layers.2.multihead_attn.out_proj.weight", "transformer.clip_interaction_decoder.layers.2.multihead_attn.out_proj.bias", "transformer.clip_interaction_decoder.layers.2.linear1.weight", "transforme r.clip_interaction_decoder.layers.2.linear1.bias", "transformer.clip_interaction_decoder.layers.2.linear2.weight", "transformer.clip_interaction_decoder.layers.2.linear2.bias", "transformer.clip_interaction_decoder.layers.2.norm 1.weight", "transformer.clip_interaction_decoder.layers.2.norm1.bias", "transformer.clip_interaction_decoder.layers.2.norm2.weight", "transformer.clip_interaction_decoder.layers.2.norm2.bias", "transformer.clip_interaction_decod er.layers.2.norm3.weight", "transformer.clip_interaction_decoder.layers.2.norm3.bias", "transformer.clip_interaction_decoder.layers.2.norm4.weight", "transformer.clip_interaction_decoder.layers.2.norm4.bias", "transformer.clip_i nteraction_decoder.norm.weight", "transformer.clip_interaction_decoder.norm.bias", "transformer.inter_guided_embedd.weight", "transformer.queries2spacial_proj.weight", "transformer.queries2spacial_proj.bias", "transformer.queries2spacial_proj_norm.weight", "transformer.queries2spacial_proj_norm.bias", "transformer.obj_class_fc.weight", "transformer.obj_class_fc.bias", "transformer.obj_class_ln.weight", "transformer.obj_class_ln.bias".

Artanic30 commented 10 months ago

Opps! I find a bug in visualization code. The visualization code actually visulize the results of GEN-VLKT. I fix the code in models/visualization_hoiclip/gen_vlkt.py:17, which is from .gen import build_gen -> from .et_gen import build_gen. You may try again with latest code.

hwuidue commented 10 months ago

Traceback (most recent call last): File "main.py", line 597, in main(args) File "main.py", line 447, in main return self._call_impl(*args, kwargs) File "C:\Users\wanni\anaconda3\envs\hoiclip\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "E:\PycharmProject\hoiclip_new\HOICLIP-main\models\visualization_hoiclip\gen_vlkt.py", line 185, in forward h_hs, o_hs, inter_hs, clip_cls_feature, clip_hoi_score, clip_visual, weight, weight_2 = self.transformer(self.input_proj(src), mask, File "C:\Users\wanni\anaconda3\envs\hoiclip\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "C:\Users\wanni\anaconda3\envs\hoiclip\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "E:\PycharmProject\hoiclip_new\HOICLIP-main\models\visualization_hoiclip\et_gen.py", line 121, in forward clip_cls_feature, clip_visual = clip_model.encode_image(clip_src) ValueError: too many values to unpack (expected 2)

Artanic30 commented 10 months ago

Sorry for another bug. It should work with latest codes.

hwuidue commented 10 months ago

Thanks to the author, hope to update the code.

ashi701 commented 10 months ago

(hoiclip) ayushi@gpusrv002:/DATA/scene_graph/ayushi/HOICLIP$ python main.py --pretrained checkpoint_rf_uc.pth --dataset_file hico --hoi_path data/hico_20160224_det --num_obj_classes 80 --num_verb_classes 117 --backbone resnet50 --num_queries 64 --dec_layers 3 --eval --zero_shot_type default --with_clip_label --with_obj_clip_label --use_nms_filter Not using distributed mode setting up seeds git: sha: 06b65177e59395a4d10e0220c95e54aa6686d54c, status: has uncommited changes, branch: main

Namespace(lr=0.0001, lr_backbone=1e-05, lr_clip=1e-05, batch_size=2, weight_decay=0.0001, epochs=150, lr_drop=100, clip_max_norm=0.1, eval_each=4, eval_each_lr_drop=2, frozen_weights=None, backbone='resnet50', dilation=False, position_embedding='sine', enc_layers=6, dec_layers=3, dim_feedforward=2048, hidden_dim=256, dropout=0.1, nheads=8, num_queries=64, pre_norm=False, masks=False, hoi=False, num_obj_classes=80, num_verb_classes=117, pretrained='checkpoint_rf_uc.pth', subject_category_id=0, verb_loss_type='focal', aux_loss=True, with_mimic=False, set_cost_class=1, set_cost_bbox=2.5, set_cost_giou=1, set_cost_obj_class=1, set_cost_verb_class=1, set_cost_hoi=1, mask_loss_coef=1, dice_loss_coef=1, bbox_loss_coef=2.5, giou_loss_coef=1, obj_loss_coef=1, verb_loss_coef=2, hoi_loss_coef=2, mimic_loss_coef=20, alpha=0.5, eos_coef=0.1, dataset_file='hico', coco_path=None, coco_panoptic_path=None, remove_difficult=False, hoi_path='data/hico_20160224_det', output_dir='', device='cuda', seed=42, resume='', start_epoch=0, eval=True, num_workers=2, world_size=1, dist_url='env://', use_nms_filter=True, thres_nms=0.7, nms_alpha=1, nms_beta=0.5, json_file='results.json', ft_clip_with_small_lr=False, with_clip_label=True, with_obj_clip_label=True, clip_model='ViT-B/32', fix_clip=False, clip_embed_dim=512, zero_shot_type='default', del_unseen=False, fix_backbone_mode=[], use_ddp=1, with_random_shuffle=2, gradient_accumulation_steps=1, opt_sched='multiStep', no_clip_cls_init=False, enable_amp=False, opt_level='O2', fix_clip_label=False, with_rec_loss=False, rec_loss_coef=2, no_training=False, dataset_root='GEN', model_name='GEN', eval_location=False, enable_cp=False, no_fix_clip_linear=False, analysis=False, alternative=1, eval_each_ap=False, topk_hoi=10, inter_dec_layers=3, verb_pth='', verb_weight=0.5, frac=-1.0, validation_split=-1.0, lr_drop_gamma=0.1, training_free_enhancement_path='', distributed=False) /home/ayushi/.conda/envs/hoiclip/lib/python3.11/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. warnings.warn( /home/ayushi/.conda/envs/hoiclip/lib/python3.11/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=None. warnings.warn(msg)


GEN


number of params: 193327284 init dataloader train contains 37633 images and 117871 annotations val contains 9546 images and 0 annotations rare:138, non-rare:462 val contains 9546 images and 0 annotations rare:138, non-rare:462 dataloader finished Traceback (most recent call last): File "/DATA/scene_graph/ayushi/HOICLIP/main.py", line 588, in main(args) File "/DATA/scene_graph/ayushi/HOICLIP/main.py", line 425, in main model_without_ddp.load_state_dict(checkpoint['model'], strict=True) File "/home/ayushi/.conda/envs/hoiclip/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for GEN_VLKT: Unexpected key(s) in state_dict: "verb2hoi_proj", "verb2hoi_proj_eval", "inter2verb.layers.0.weight", "inter2verb.layers.0.bias", "inter2verb.layers.1.weight", "inter2verb.layers.1.bias", "inter2verb.layers.2.weight", "inter2verb.layers.2.bias", "verb_projection.weight", "eval_visual_projection.weight", "transformer.clip_interaction_decoder.layers.0.self_attn.in_proj_weight", "transformer.clip_interaction_decoder.layers.0.self_attn.in_proj_bias", "transformer.clip_interaction_decoder.layers.0.self_attn.out_proj.weight", "transformer.clip_interaction_decoder.layers.0.self_attn.out_proj.bias", "transformer.clip_interaction_decoder.layers.0.multihead_attn.in_proj_weight", "transformer.clip_interaction_decoder.layers.0.multihead_attn.in_proj_bias", "transformer.clip_interaction_decoder.layers.0.multihead_attn.out_proj.weight", "transformer.clip_interaction_decoder.layers.0.multihead_attn.out_proj.bias", "transformer.clip_interaction_decoder.layers.0.linear1.weight", "transformer.clip_interaction_decoder.layers.0.linear1.bias", "transformer.clip_interaction_decoder.layers.0.linear2.weight", "transformer.clip_interaction_decoder.layers.0.linear2.bias", "transformer.clip_interaction_decoder.layers.0.norm1.weight", "transformer.clip_interaction_decoder.layers.0.norm1.bias", "transformer.clip_interaction_decoder.layers.0.norm2.weight", "transformer.clip_interaction_decoder.layers.0.norm2.bias", "transformer.clip_interaction_decoder.layers.0.norm3.weight", "transformer.clip_interaction_decoder.layers.0.norm3.bias", "transformer.clip_interaction_decoder.layers.0.norm4.weight", "transformer.clip_interaction_decoder.layers.0.norm4.bias", "transformer.clip_interaction_decoder.layers.1.self_attn.in_proj_weight", "transformer.clip_interaction_decoder.layers.1.self_attn.in_proj_bias", "transformer.clip_interaction_decoder.layers.1.self_attn.out_proj.weight", "transformer.clip_interaction_decoder.layers.1.self_attn.out_proj.bias", "transformer.clip_interaction_decoder.layers.1.multihead_attn.in_proj_weight", "transformer.clip_interaction_decoder.layers.1.multihead_attn.in_proj_bias", "transformer.clip_interaction_decoder.layers.1.multihead_attn.out_proj.weight", "transformer.clip_interaction_decoder.layers.1.multihead_attn.out_proj.bias", "transformer.clip_interaction_decoder.layers.1.linear1.weight", "transformer.clip_interaction_decoder.layers.1.linear1.bias", "transformer.clip_interaction_decoder.layers.1.linear2.weight", "transformer.clip_interaction_decoder.layers.1.linear2.bias", "transformer.clip_interaction_decoder.layers.1.norm1.weight", "transformer.clip_interaction_decoder.layers.1.norm1.bias", "transformer.clip_interaction_decoder.layers.1.norm2.weight", "transformer.clip_interaction_decoder.layers.1.norm2.bias", "transformer.clip_interaction_decoder.layers.1.norm3.weight", "transformer.clip_interaction_decoder.layers.1.norm3.bias", "transformer.clip_interaction_decoder.layers.1.norm4.weight", "transformer.clip_interaction_decoder.layers.1.norm4.bias", "transformer.clip_interaction_decoder.layers.2.self_attn.in_proj_weight", "transformer.clip_interaction_decoder.layers.2.self_attn.in_proj_bias", "transformer.clip_interaction_decoder.layers.2.self_attn.out_proj.weight", "transformer.clip_interaction_decoder.layers.2.self_attn.out_proj.bias", "transformer.clip_interaction_decoder.layers.2.multihead_attn.in_proj_weight", "transformer.clip_interaction_decoder.layers.2.multihead_attn.in_proj_bias", "transformer.clip_interaction_decoder.layers.2.multihead_attn.out_proj.weight", "transformer.clip_interaction_decoder.layers.2.multihead_attn.out_proj.bias", "transformer.clip_interaction_decoder.layers.2.linear1.weight", "transformer.clip_interaction_decoder.layers.2.linear1.bias", "transformer.clip_interaction_decoder.layers.2.linear2.weight", "transformer.clip_interaction_decoder.layers.2.linear2.bias", "transformer.clip_interaction_decoder.layers.2.norm1.weight", "transformer.clip_interaction_decoder.layers.2.norm1.bias", "transformer.clip_interaction_decoder.layers.2.norm2.weight", "transformer.clip_interaction_decoder.layers.2.norm2.bias", "transformer.clip_interaction_decoder.layers.2.norm3.weight", "transformer.clip_interaction_decoder.layers.2.norm3.bias", "transformer.clip_interaction_decoder.layers.2.norm4.weight", "transformer.clip_interaction_decoder.layers.2.norm4.bias", "transformer.clip_interaction_decoder.norm.weight", "transformer.clip_interaction_decoder.norm.bias", "transformer.inter_guided_embedd.weight", "transformer.queries2spacial_proj.weight", "transformer.queries2spacial_proj.bias", "transformer.queries2spacial_proj_norm.weight", "transformer.queries2spacial_proj_norm.bias", "transformer.obj_class_fc.weight", "transformer.obj_class_fc.bias", "transformer.obj_class_ln.weight", "transformer.obj_class_ln.bias". size mismatch for visual_projection.weight: copying a param with shape torch.Size([480, 512]) from checkpoint, the shape in current model is torch.Size([600, 512]). size mismatch for visual_projection.bias: copying a param with shape torch.Size([480]) from checkpoint, the shape in current model is torch.Size([600]). I am getting this error even with the updated code.

Artanic30 commented 8 months ago

--zero_shot_type default

--zero_shot_type default mean you choose the model for default setting where there are 600 classes for prediction, however, you load a zero-shot setting model weight(checkpoint_rf_uc.pth), which cause the error