OpenGVLab / unmasked_teacher

[ICCV2023 Oral] Unmasked Teacher: Towards Training-Efficient Video Foundation Models
https://arxiv.org/abs/2303.16058
MIT License
267 stars 13 forks source link

权重的读取是否有问题? #10

Closed gyftsy closed 7 months ago

gyftsy commented 10 months ago

hi,我这边在基于您的权重和代码,想实现一个关于zeroshot文字和视频相似度的测定demo,在读取权重的过程中,log会报如下的日志,从我目前的测试样本和结果来看,结果并不理想,所以想和你确认一下,我看权重读取的时候会报丢失一些key的log,这个是正常的么?我这边没有公开的数据集,是拿自己的样本测试的

2023-09-07T19:37:18 | __main__: config: 
{'data_dir': 'your_data_path/anno', 'data_root': 'your_data_path/anno/videos_images', 'anno_root_pt': 'your_data_path/anno/anno_pretrain', 'anno_root_downstream': 'your_data_path/anno/anno_downstream', 'TextEncoders': {'bert': {'name': 'bert_base', 'pretrained': '/mnt/dolphinfs/ssd_pool/docker/user/hadoop-seccv/gengyifei/gitcode/pretrain_models/LAVIS/bert-base-uncased', 'config': '/mnt/dolphinfs/ssd_pool/docker/user/hadoop-seccv/gengyifei/gitcode/pretrain_models/unmasked_teacher/multi_modality/configs/config_bert.json', 'd_model': 768, 'fusion_layer': 9}, 'bert_large': {'name': 'bert_large', 'pretrained': 'bert-large-uncased', 'config': 'configs/config_bert_large.json', 'd_model': 1024, 'fusion_layer': 19}}, 'train_file': ['your_data_path/anno/anno_downstream/msrvtt_ret_train9k.json', 'your_msrvtt_path', 'video'], 'test_file': {'test': ['/mnt/dolphinfs/ssd_pool/docker/user/hadoop-seccv/gengyifei/gitcode/pretrain_models/unmasked_teacher/dataset/secure_wenshen_test/secure_wenshen_test.json', '/mnt/dolphinfs/ssd_pool/docker/user/hadoop-seccv/gengyifei/gitcode/pretrain_models/unmasked_teacher/dataset/secure_wenshen_test/video', 'video']}, 'test_types': ['test'], 'num_workers': 6, 'stop_key': 'test/', 'is_paragraph_retrieval': False, 'num_frames': 4, 'num_frames_test': 4, 'batch_size': 32, 'max_txt_l': 32, 'inputs': {'image_res': 224, 'video_input': {'num_frames': 4, 'sample_type': 'rand', 'num_frames_test': 4, 'sample_type_test': 'middle', 'random_aug': False}, 'max_txt_l': {'image': 32, 'video': 32}, 'batch_size': {'image': 32, 'video': 32}, 'batch_size_test': {'image': 32, 'video': 32}}, 'text_enc': 'bert', 'model': {'model_cls': 'UMT', 'vision_encoder': {'name': 'vit_b16', 'img_size': 224, 'patch_size': 16, 'd_model': 768, 'encoder_embed_dim': 768, 'encoder_depth': 12, 'encoder_num_heads': 12, 'drop_path_rate': 0.2, 'num_frames': 4, 'tubelet_size': 1, 'use_checkpoint': True, 'checkpoint_num': 12, 'clip_decoder_embed_dim': 768, 'clip_output_dim': 512, 'clip_return_layer': 0, 'clip_student_return_interval': 1, 'pretrained': '/mnt/dolphinfs/ssd_pool/docker/user/hadoop-seccv/gengyifei/gitcode/pretrain_models/umt_weights/b16_ptk710_f8_res224.pth', 'clip_teacher': 'none', 'clip_img_size': 224, 'clip_return_interval': 1, 'video_mask_type': 'attention', 'video_mask_ratio': 0.0, 'video_double_mask_ratio': 0.0, 'image_mask_type': 'attention', 'image_mask_ratio': 0.0, 'image_double_mask_ratio': 0.0, 'keep_temporal': True}, 'text_encoder': {'name': 'bert_base', 'pretrained': '/mnt/dolphinfs/ssd_pool/docker/user/hadoop-seccv/gengyifei/gitcode/pretrain_models/LAVIS/bert-base-uncased', 'config': '/mnt/dolphinfs/ssd_pool/docker/user/hadoop-seccv/gengyifei/gitcode/pretrain_models/unmasked_teacher/multi_modality/configs/config_bert.json', 'd_model': 768, 'fusion_layer': 9}, 'multimodal': {'enable': True}, 'embed_dim': 512, 'temp': 0.07}, 'criterion': {'loss_weight': {'vtc': 1.0, 'mlm': 0.0, 'vtm': 1.0, 'uta': 0.0}, 'vtm_hard_neg': True, 'mlm_masking_prob': 0.5, 'uta_norm_type': 'l2', 'uta_loss_type': 'l2'}, 'optimizer': {'opt': 'adamW', 'lr': 2e-05, 'opt_betas': [0.9, 0.999], 'weight_decay': 0.02, 'max_grad_norm': -1, 'different_lr': {'enable': False, 'module_names': [], 'lr': 0.001}}, 'scheduler': {'sched': 'cosine', 'epochs': 7, 'min_lr_multi': 0.01, 'warmup_epochs': 1}, 'evaluate': True, 'deep_fusion': False, 'evaluation': {'eval_frame_ensemble': 'concat', 'eval_x_only': False, 'k_test': 128, 'eval_offload': True}, 'fp16': True, 'gradient_checkpointing': True, 'wandb': {'enable': False, 'entity': 'user', 'project': 'umt'}, 'dist_url': 'env://', 'device': 'cuda', 'mode': 'pt', 'output_dir': '/mnt/dolphinfs/ssd_pool/docker/user/hadoop-seccv/gengyifei/gitcode/pretrain_models/umt_weights/eval', 'resume': False, 'debug': False, 'log_freq': 100, 'seed': 42, 'zero_shot': True, 'save_latest': True, 'auto_resume': False, 'pretrained_path': '/mnt/dolphinfs/ssd_pool/docker/user/hadoop-seccv/gengyifei/gitcode/pretrain_models/umt_weights/b16_25m.pth', 'distributed': False}
2023-09-07T19:37:18 | __main__: train_file: ['your_data_path/anno/anno_downstream/msrvtt_ret_train9k.json', 'your_msrvtt_path', 'video']
2023-09-07T19:37:18 | tasks.pretrain: Creating dataset for ret
Loading /mnt/dolphinfs/ssd_pool/docker/user/hadoop-seccv/gengyifei/gitcode/pretrain_models/unmasked_teacher/dataset/secure_wensh
2023-09-07T19:37:18 | tasks.shared_utils: Creating model
2023-09-07T19:37:20 | models.umt: Build vision_encoder: vit_b16
2023-09-07T19:37:21 | models.backbones.vit.vit: Num of patches: 784
2023-09-07T19:37:21 | models.backbones.vit.vit: Use checkpoint: True
2023-09-07T19:37:21 | models.backbones.vit.vit: Checkpoint number: 12
2023-09-07T19:37:21 | models.backbones.vit.vit: Student return index: []
2023-09-07T19:37:25 | models.backbones.vit.vit: Loading pretrained weights from /mnt/dolphinfs/ssd_pool/docker/user/hadoop-seccv/gengyifei/gitcode/pretrain_models/umt_weights/b16_ptk710_f8_res224.pth
2023-09-07T19:37:26 | models.umt: Build text_encoder bert_base
2023-09-07T19:37:28 | models.backbones.bert.xbert: build bert with cross_module: ca
2023-09-07T19:37:28 | models.criterions: Norm type: l2
2023-09-07T19:37:28 | models.criterions: Loss type: l2
2023-09-07T19:37:34 | utils.optimizer: diff_names: [], diff_lr: None
2023-09-07T19:37:34 | utils.optimizer: param temp: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.patch_embed.proj.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.patch_embed.proj.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.0.norm1.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.0.norm1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.0.attn.q_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.0.attn.v_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.0.attn.qkv.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.0.attn.proj.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.0.attn.proj.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.0.norm2.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.0.norm2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.0.mlp.fc1.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.0.mlp.fc1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.0.mlp.fc2.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.0.mlp.fc2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.1.norm1.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.1.norm1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.1.attn.q_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.1.attn.v_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.1.attn.qkv.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.1.attn.proj.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.1.attn.proj.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.1.norm2.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.1.norm2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.1.mlp.fc1.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.1.mlp.fc1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.1.mlp.fc2.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.1.mlp.fc2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.2.norm1.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.2.norm1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.2.attn.q_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.2.attn.v_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.2.attn.qkv.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.2.attn.proj.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.2.attn.proj.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.2.norm2.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.2.norm2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.2.mlp.fc1.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.2.mlp.fc1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.2.mlp.fc2.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.2.mlp.fc2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.3.norm1.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.3.norm1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.3.attn.q_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.3.attn.v_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.3.attn.qkv.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.3.attn.proj.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.3.attn.proj.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.3.norm2.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.3.norm2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.3.mlp.fc1.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.3.mlp.fc1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.3.mlp.fc2.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.3.mlp.fc2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.4.norm1.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.4.norm1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.4.attn.q_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.4.attn.v_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.4.attn.qkv.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.4.attn.proj.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.4.attn.proj.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.4.norm2.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.4.norm2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.4.mlp.fc1.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.4.mlp.fc1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.4.mlp.fc2.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.4.mlp.fc2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.5.norm1.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.5.norm1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.5.attn.q_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.5.attn.v_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.5.attn.qkv.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.5.attn.proj.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.5.attn.proj.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.5.norm2.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.5.norm2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.5.mlp.fc1.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.5.mlp.fc1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.5.mlp.fc2.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.5.mlp.fc2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.6.norm1.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.6.norm1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.6.attn.q_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.6.attn.v_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.6.attn.qkv.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.6.attn.proj.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.6.attn.proj.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.6.norm2.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.6.norm2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.6.mlp.fc1.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.6.mlp.fc1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.6.mlp.fc2.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.6.mlp.fc2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.7.norm1.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.7.norm1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.7.attn.q_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.7.attn.v_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.7.attn.qkv.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.7.attn.proj.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.7.attn.proj.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.7.norm2.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.7.norm2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.7.mlp.fc1.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.7.mlp.fc1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.7.mlp.fc2.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.7.mlp.fc2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.8.norm1.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.8.norm1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.8.attn.q_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.8.attn.v_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.8.attn.qkv.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.8.attn.proj.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.8.attn.proj.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.8.norm2.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.8.norm2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.8.mlp.fc1.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.8.mlp.fc1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.8.mlp.fc2.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.8.mlp.fc2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.9.norm1.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.9.norm1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.9.attn.q_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.9.attn.v_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.9.attn.qkv.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.9.attn.proj.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.9.attn.proj.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.9.norm2.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.9.norm2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.9.mlp.fc1.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.9.mlp.fc1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.9.mlp.fc2.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.9.mlp.fc2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.10.norm1.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.10.norm1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.10.attn.q_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.10.attn.v_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.10.attn.qkv.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.10.attn.proj.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.10.attn.proj.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.10.norm2.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.10.norm2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.10.mlp.fc1.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.10.mlp.fc1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.10.mlp.fc2.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.10.mlp.fc2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.11.norm1.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.11.norm1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.11.attn.q_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.11.attn.v_bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.11.attn.qkv.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.11.attn.proj.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.11.attn.proj.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.11.norm2.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.11.norm2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.11.mlp.fc1.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.11.mlp.fc1.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.11.mlp.fc2.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.blocks.11.mlp.fc2.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.norm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.encoder.norm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.pool_norm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_encoder.pool_norm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.embeddings.word_embeddings.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.embeddings.position_embeddings.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.embeddings.token_type_embeddings.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.embeddings.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.embeddings.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.0.attention.self.query.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.0.attention.self.query.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.0.attention.self.key.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.0.attention.self.key.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.0.attention.self.value.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.0.attention.self.value.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.0.attention.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.0.attention.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.0.attention.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.0.attention.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.0.intermediate.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.0.intermediate.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.0.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.0.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.0.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.0.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.1.attention.self.query.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.1.attention.self.query.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.1.attention.self.key.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.1.attention.self.key.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.1.attention.self.value.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.1.attention.self.value.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.1.attention.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.1.attention.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.1.attention.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.1.attention.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.1.intermediate.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.1.intermediate.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.1.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.1.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.1.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.1.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.2.attention.self.query.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.2.attention.self.query.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.2.attention.self.key.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.2.attention.self.key.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.2.attention.self.value.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.2.attention.self.value.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.2.attention.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.2.attention.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.2.attention.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.2.attention.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.2.intermediate.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.2.intermediate.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.2.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.2.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.2.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.2.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.3.attention.self.query.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.3.attention.self.query.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.3.attention.self.key.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.3.attention.self.key.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.3.attention.self.value.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.3.attention.self.value.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.3.attention.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.3.attention.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.3.attention.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.3.attention.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.3.intermediate.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.3.intermediate.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.3.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.3.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.3.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.3.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.4.attention.self.query.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.4.attention.self.query.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.4.attention.self.key.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.4.attention.self.key.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.4.attention.self.value.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.4.attention.self.value.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.4.attention.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.4.attention.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.4.attention.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.4.attention.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.4.intermediate.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.4.intermediate.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.4.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.4.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.4.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.4.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.5.attention.self.query.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.5.attention.self.query.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.5.attention.self.key.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.5.attention.self.key.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.5.attention.self.value.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.5.attention.self.value.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.5.attention.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.5.attention.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.5.attention.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.5.attention.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.5.intermediate.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.5.intermediate.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.5.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.5.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.5.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.5.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.6.attention.self.query.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.6.attention.self.query.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.6.attention.self.key.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.6.attention.self.key.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.6.attention.self.value.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.6.attention.self.value.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.6.attention.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.6.attention.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.6.attention.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.6.attention.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.6.intermediate.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.6.intermediate.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.6.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.6.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.6.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.6.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.7.attention.self.query.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.7.attention.self.query.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.7.attention.self.key.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.7.attention.self.key.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.7.attention.self.value.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.7.attention.self.value.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.7.attention.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.7.attention.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.7.attention.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.7.attention.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.7.intermediate.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.7.intermediate.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.7.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.7.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.7.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.7.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.8.attention.self.query.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.8.attention.self.query.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.8.attention.self.key.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.8.attention.self.key.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.8.attention.self.value.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.8.attention.self.value.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.8.attention.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.8.attention.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.8.attention.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.8.attention.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.8.intermediate.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.8.intermediate.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.8.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.8.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.8.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.8.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.attention.self.query.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.attention.self.query.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.attention.self.key.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.attention.self.key.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.attention.self.value.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.attention.self.value.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.attention.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.attention.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.attention.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.attention.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.crossattention.self.query.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.crossattention.self.query.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.crossattention.self.key.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.crossattention.self.key.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.crossattention.self.value.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.crossattention.self.value.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.crossattention.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.crossattention.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.crossattention.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.crossattention.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.intermediate.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.intermediate.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.9.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.attention.self.query.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.attention.self.query.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.attention.self.key.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.attention.self.key.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.attention.self.value.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.attention.self.value.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.attention.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.attention.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.attention.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.attention.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.crossattention.self.query.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.crossattention.self.query.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.crossattention.self.key.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.crossattention.self.key.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.crossattention.self.value.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.crossattention.self.value.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.crossattention.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.crossattention.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.crossattention.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.crossattention.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.intermediate.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.intermediate.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.10.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.attention.self.query.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.attention.self.query.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.attention.self.key.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.attention.self.key.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.attention.self.value.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.attention.self.value.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.attention.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.attention.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.attention.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.attention.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.crossattention.self.query.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.crossattention.self.query.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.crossattention.self.key.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.crossattention.self.key.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.crossattention.self.value.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.crossattention.self.value.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.crossattention.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.crossattention.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.crossattention.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.crossattention.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.intermediate.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.intermediate.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.output.dense.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.output.dense.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.output.LayerNorm.weight: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_encoder.encoder.layer.11.output.LayerNorm.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_proj.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param vision_proj.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_proj.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param text_proj.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param itm_head.weight: wd: 0.02, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: param itm_head.bias: wd: 0, lr: 2e-05
2023-09-07T19:37:34 | utils.optimizer: optimizer -- lr=2e-05 wd=0.02 len(p)=140
2023-09-07T19:37:34 | utils.optimizer: optimizer -- lr=2e-05 wd=0 len(p)=256
2023-09-07T19:37:36 | tasks.shared_utils: _IncompatibleKeys(missing_keys=[], unexpected_keys=['clip_teacher.class_embedding', 'clip_teacher.positional_embedding', 'clip_teacher.proj', 'clip_teacher.conv1.weight', 'clip_teacher.ln_pre.weight', 'clip_teacher.ln_pre.bias', 'clip_teacher.transformer.resblocks.0.attn.in_proj_weight', 'clip_teacher.transformer.resblocks.0.attn.in_proj_bias', 'clip_teacher.transformer.resblocks.0.attn.out_proj.weight', 'clip_teacher.transformer.resblocks.0.attn.out_proj.bias', 'clip_teacher.transformer.resblocks.0.ln_1.weight', 'clip_teacher.transformer.resblocks.0.ln_1.bias', 'clip_teacher.transformer.resblocks.0.mlp.c_fc.weight', 'clip_teacher.transformer.resblocks.0.mlp.c_fc.bias', 'clip_teacher.transformer.resblocks.0.mlp.c_proj.weight', 'clip_teacher.transformer.resblocks.0.mlp.c_proj.bias', 'clip_teacher.transformer.resblocks.0.ln_2.weight', 'clip_teacher.transformer.resblocks.0.ln_2.bias', 'clip_teacher.transformer.resblocks.1.attn.in_proj_weight', 'clip_teacher.transformer.resblocks.1.attn.in_proj_bias', 'clip_teacher.transformer.resblocks.1.attn.out_proj.weight', 'clip_teacher.transformer.resblocks.1.attn.out_proj.bias', 'clip_teacher.transformer.resblocks.1.ln_1.weight', 'clip_teacher.transformer.resblocks.1.ln_1.bias', 'clip_teacher.transformer.resblocks.1.mlp.c_fc.weight', 'clip_teacher.transformer.resblocks.1.mlp.c_fc.bias', 'clip_teacher.transformer.resblocks.1.mlp.c_proj.weight', 'clip_teacher.transformer.resblocks.1.mlp.c_proj.bias', 'clip_teacher.transformer.resblocks.1.ln_2.weight', 'clip_teacher.transformer.resblocks.1.ln_2.bias', 'clip_teacher.transformer.resblocks.2.attn.in_proj_weight', 'clip_teacher.transformer.resblocks.2.attn.in_proj_bias', 'clip_teacher.transformer.resblocks.2.attn.out_proj.weight', 'clip_teacher.transformer.resblocks.2.attn.out_proj.bias', 'clip_teacher.transformer.resblocks.2.ln_1.weight', 'clip_teacher.transformer.resblocks.2.ln_1.bias', 'clip_teacher.transformer.resblocks.2.mlp.c_fc.weight', 'clip_teacher.transformer.resblocks.2.mlp.c_fc.bias', 'clip_teacher.transformer.resblocks.2.mlp.c_proj.weight', 'clip_teacher.transformer.resblocks.2.mlp.c_proj.bias', 'clip_teacher.transformer.resblocks.2.ln_2.weight', 'clip_teacher.transformer.resblocks.2.ln_2.bias', 'clip_teacher.transformer.resblocks.3.attn.in_proj_weight', 'clip_teacher.transformer.resblocks.3.attn.in_proj_bias', 'clip_teacher.transformer.resblocks.3.attn.out_proj.weight', 'clip_teacher.transformer.resblocks.3.attn.out_proj.bias', 'clip_teacher.transformer.resblocks.3.ln_1.weight', 'clip_teacher.transformer.resblocks.3.ln_1.bias', 'clip_teacher.transformer.resblocks.3.mlp.c_fc.weight', 'clip_teacher.transformer.resblocks.3.mlp.c_fc.bias', 'clip_teacher.transformer.resblocks.3.mlp.c_proj.weight', 'clip_teacher.transformer.resblocks.3.mlp.c_proj.bias', 'clip_teacher.transformer.resblocks.3.ln_2.weight', 'clip_teacher.transformer.resblocks.3.ln_2.bias', 'clip_teacher.transformer.resblocks.4.attn.in_proj_weight', 'clip_teacher.transformer.resblocks.4.attn.in_proj_bias', 'clip_teacher.transformer.resblocks.4.attn.out_proj.weight', 'clip_teacher.transformer.resblocks.4.attn.out_proj.bias', 'clip_teacher.transformer.resblocks.4.ln_1.weight', 'clip_teacher.transformer.resblocks.4.ln_1.bias', 'clip_teacher.transformer.resblocks.4.mlp.c_fc.weight', 'clip_teacher.transformer.resblocks.4.mlp.c_fc.bias', 'clip_teacher.transformer.resblocks.4.mlp.c_proj.weight', 'clip_teacher.transformer.resblocks.4.mlp.c_proj.bias', 'clip_teacher.transformer.resblocks.4.ln_2.weight', 'clip_teacher.transformer.resblocks.4.ln_2.bias', 'clip_teacher.transformer.resblocks.5.attn.in_proj_weight', 'clip_teacher.transformer.resblocks.5.attn.in_proj_bias', 'clip_teacher.transformer.resblocks.5.attn.out_proj.weight', 'clip_teacher.transformer.resblocks.5.attn.out_proj.bias', 'clip_teacher.transformer.resblocks.5.ln_1.weight', 'clip_teacher.transformer.resblocks.5.ln_1.bias', 'clip_teacher.transformer.resblocks.5.mlp.c_fc.weight', 'clip_teacher.transformer.resblocks.5.mlp.c_fc.bias', 'clip_teacher.transformer.resblocks.5.mlp.c_proj.weight', 'clip_teacher.transformer.resblocks.5.mlp.c_proj.bias', 'clip_teacher.transformer.resblocks.5.ln_2.weight', 'clip_teacher.transformer.resblocks.5.ln_2.bias', 'clip_teacher.transformer.resblocks.6.attn.in_proj_weight', 'clip_teacher.transformer.resblocks.6.attn.in_proj_bias', 'clip_teacher.transformer.resblocks.6.attn.out_proj.weight', 'clip_teacher.transformer.resblocks.6.attn.out_proj.bias', 'clip_teacher.transformer.resblocks.6.ln_1.weight', 'clip_teacher.transformer.resblocks.6.ln_1.bias', 'clip_teacher.transformer.resblocks.6.mlp.c_fc.weight', 'clip_teacher.transformer.resblocks.6.mlp.c_fc.bias', 'clip_teacher.transformer.resblocks.6.mlp.c_proj.weight', 'clip_teacher.transformer.resblocks.6.mlp.c_proj.bias', 'clip_teacher.transformer.resblocks.6.ln_2.weight', 'clip_teacher.transformer.resblocks.6.ln_2.bias', 'clip_teacher.transformer.resblocks.7.attn.in_proj_weight', 'clip_teacher.transformer.resblocks.7.attn.in_proj_bias', 'clip_teacher.transformer.resblocks.7.attn.out_proj.weight', 'clip_teacher.transformer.resblocks.7.attn.out_proj.bias', 'clip_teacher.transformer.resblocks.7.ln_1.weight', 'clip_teacher.transformer.resblocks.7.ln_1.bias', 'clip_teacher.transformer.resblocks.7.mlp.c_fc.weight', 'clip_teacher.transformer.resblocks.7.mlp.c_fc.bias', 'clip_teacher.transformer.resblocks.7.mlp.c_proj.weight', 'clip_teacher.transformer.resblocks.7.mlp.c_proj.bias', 'clip_teacher.transformer.resblocks.7.ln_2.weight', 'clip_teacher.transformer.resblocks.7.ln_2.bias', 'clip_teacher.transformer.resblocks.8.attn.in_proj_weight', 'clip_teacher.transformer.resblocks.8.attn.in_proj_bias', 'clip_teacher.transformer.resblocks.8.attn.out_proj.weight', 'clip_teacher.transformer.resblocks.8.attn.out_proj.bias', 'clip_teacher.transformer.resblocks.8.ln_1.weight', 'clip_teacher.transformer.resblocks.8.ln_1.bias', 'clip_teacher.transformer.resblocks.8.mlp.c_fc.weight', 'clip_teacher.transformer.resblocks.8.mlp.c_fc.bias', 'clip_teacher.transformer.resblocks.8.mlp.c_proj.weight', 'clip_teacher.transformer.resblocks.8.mlp.c_proj.bias', 'clip_teacher.transformer.resblocks.8.ln_2.weight', 'clip_teacher.transformer.resblocks.8.ln_2.bias', 'clip_teacher.transformer.resblocks.9.attn.in_proj_weight', 'clip_teacher.transformer.resblocks.9.attn.in_proj_bias', 'clip_teacher.transformer.resblocks.9.attn.out_proj.weight', 'clip_teacher.transformer.resblocks.9.attn.out_proj.bias', 'clip_teacher.transformer.resblocks.9.ln_1.weight', 'clip_teacher.transformer.resblocks.9.ln_1.bias', 'clip_teacher.transformer.resblocks.9.mlp.c_fc.weight', 'clip_teacher.transformer.resblocks.9.mlp.c_fc.bias', 'clip_teacher.transformer.resblocks.9.mlp.c_proj.weight', 'clip_teacher.transformer.resblocks.9.mlp.c_proj.bias', 'clip_teacher.transformer.resblocks.9.ln_2.weight', 'clip_teacher.transformer.resblocks.9.ln_2.bias', 'clip_teacher.transformer.resblocks.10.attn.in_proj_weight', 'clip_teacher.transformer.resblocks.10.attn.in_proj_bias', 'clip_teacher.transformer.resblocks.10.attn.out_proj.weight', 'clip_teacher.transformer.resblocks.10.attn.out_proj.bias', 'clip_teacher.transformer.resblocks.10.ln_1.weight', 'clip_teacher.transformer.resblocks.10.ln_1.bias', 'clip_teacher.transformer.resblocks.10.mlp.c_fc.weight', 'clip_teacher.transformer.resblocks.10.mlp.c_fc.bias', 'clip_teacher.transformer.resblocks.10.mlp.c_proj.weight', 'clip_teacher.transformer.resblocks.10.mlp.c_proj.bias', 'clip_teacher.transformer.resblocks.10.ln_2.weight', 'clip_teacher.transformer.resblocks.10.ln_2.bias', 'clip_teacher.transformer.resblocks.11.attn.in_proj_weight', 'clip_teacher.transformer.resblocks.11.attn.in_proj_bias', 'clip_teacher.transformer.resblocks.11.attn.out_proj.weight', 'clip_teacher.transformer.resblocks.11.attn.out_proj.bias', 'clip_teacher.transformer.resblocks.11.ln_1.weight', 'clip_teacher.transformer.resblocks.11.ln_1.bias', 'clip_teacher.transformer.resblocks.11.mlp.c_fc.weight', 'clip_teacher.transformer.resblocks.11.mlp.c_fc.bias', 'clip_teacher.transformer.resblocks.11.mlp.c_proj.weight', 'clip_teacher.transformer.resblocks.11.mlp.c_proj.bias', 'clip_teacher.transformer.resblocks.11.ln_2.weight', 'clip_teacher.transformer.resblocks.11.ln_2.bias', 'clip_teacher.ln_post.weight', 'clip_teacher.ln_post.bias', 'vision_encoder.clip_decoder.0.head.weight', 'vision_encoder.clip_decoder.0.head.bias', 'vision_encoder.clip_decoder.0.norm.weight', 'vision_encoder.clip_decoder.0.norm.bias', 'vision_encoder.clip_decoder.1.head.weight', 'vision_encoder.clip_decoder.1.head.bias', 'vision_encoder.clip_decoder.1.norm.weight', 'vision_encoder.clip_decoder.1.norm.bias', 'vision_encoder.clip_decoder.2.head.weight', 'vision_encoder.clip_decoder.2.head.bias', 'vision_encoder.clip_decoder.2.norm.weight', 'vision_encoder.clip_decoder.2.norm.bias', 'vision_encoder.clip_decoder.3.head.weight', 'vision_encoder.clip_decoder.3.head.bias', 'vision_encoder.clip_decoder.3.norm.weight', 'vision_encoder.clip_decoder.3.norm.bias', 'vision_encoder.clip_decoder.4.head.weight', 'vision_encoder.clip_decoder.4.head.bias', 'vision_encoder.clip_decoder.4.norm.weight', 'vision_encoder.clip_decoder.4.norm.bias', 'vision_encoder.clip_decoder.5.head.weight', 'vision_encoder.clip_decoder.5.head.bias', 'vision_encoder.clip_decoder.5.norm.weight', 'vision_encoder.clip_decoder.5.norm.bias', 'text_encoder.cls.predictions.bias', 'text_encoder.cls.predictions.transform.dense.weight', 'text_encoder.cls.predictions.transform.dense.bias', 'text_encoder.cls.predictions.transform.LayerNorm.weight', 'text_encoder.cls.predictions.transform.LayerNorm.bias', 'text_encoder.cls.predictions.decoder.weight', 'text_encoder.cls.predictions.decoder.bias'])
2023-09-07T19:37:36 | tasks.shared_utils: Loaded checkpoint from /mnt/dolphinfs/ssd_pool/docker/user/hadoop-seccv/gengyifei/gitcode/pretrain_models/umt_weights/b16_25m.pth
2023-09-07T19:37:36 | __main__: Start evaluation
2023-09-07T19:37:36 | tasks.retrieval_utils: Start evaluation for media_type=video
2023-09-07T19:37:36 | tasks.retrieval_utils: Computing dual encoder features...
WARNING 2023-09-07T19:37:46 | py.warnings: /home/hadoop-seccv/ssd/anaconda3/envs/vl_cp/lib/python3.9/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")

WARNING 2023-09-07T19:37:46 | py.warnings: /home/hadoop-seccv/ssd/anaconda3/envs/vl_cp/lib/python3.9/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")

2023-09-07T19:37:46 | utils.basic_utils: extracting image feats  [0/1]  eta: 0:00:09    time: 9.4264  data: 5.4320  max mem: 1309 res mem: 1396
2023-09-07T19:37:46 | utils.basic_utils: extracting image feats Total time: 0:00:09 (9.4336 s / it)
2023-09-07T19:37:46 | tasks.retrieval_utils: Finished feature extraction
2023-09-07T19:37:46 | tasks.retrieval_utils: Computing ITC scores [dot-product]
2023-09-07T19:37:46 | tasks.retrieval_utils: Computing ITC scores [dot-product], done!
2023-09-07T19:37:46 | tasks.retrieval_utils: Rerank dual-encoder results with cross-encoder...
2023-09-07T19:37:46 | tasks.retrieval_utils: i2t_scores.shape torch.Size([4, 4])
2023-09-07T19:37:46 | tasks.retrieval_utils: n_clip_per_video=1, with eval_frame_ensemble=concat
2023-09-07T19:37:46 | utils.basic_utils: Evaluation:  [0/4]  eta: 0:00:00    time: 0.0259  data: 0.0002  max mem: 1309 res mem: 1400
2023-09-07T19:37:46 | utils.basic_utils: Evaluation:  [3/4]  eta: 0:00:00    time: 0.0130  data: 0.0000  max mem: 1309 res mem: 1400
2023-09-07T19:37:46 | utils.basic_utils: Evaluation: Total time: 0:00:00 (0.0142 s / it)
2023-09-07T19:37:46 | tasks.retrieval_utils: t2i_scores.shape torch.Size([4, 4])
2023-09-07T19:37:46 | utils.basic_utils: Evaluation:  [0/4]  eta: 0:00:00    time: 0.0232  data: 0.0001  max mem: 1309 res mem: 1400
2023-09-07T19:37:46 | utils.basic_utils: Evaluation:  [3/4]  eta: 0:00:00    time: 0.0177  data: 0.0000  max mem: 1309 res mem: 1400
2023-09-07T19:37:46 | utils.basic_utils: Evaluation: Total time: 0:00:00 (0.0186 s / it)
2023-09-07T19:37:46 | tasks.retrieval_utils: Evaluation time 0:00:10
Andy1621 commented 10 months ago

这个unexpected_keys是正常的,预训练时顺带存了clip_teacher的权重。结果不理想也许和数据关系较大,没有具体的内容比较难判断,我这边上传了一个zero-shot msrvtt检索的log,您可以对照着看看。当时代码没有整理,部分函数、路径和模型名字有所不同:https://drive.google.com/file/d/1bMv2uLu0kNsQgcelHKUu7g_JAAMcriBN/view?usp=sharing

gyftsy commented 10 months ago

https://github.com/OpenGVLab/unmasked_teacher/assets/43568835/89cbbf2b-608f-43e0-8170-f7cc7b165117 我使用了这个视频,对应的描述是dance,我这边测试了不用的样例,查看到视频和文字的相似度基本都在0.2-0.3之间,比较奇怪,但是我仍然没有定位到问题在哪里,目前我使用的是b16模型 bert对应也是base 如果可以的话,辛苦您这边可以帮忙测试一下这对数据的相似度 也就是代码中 i2t_scores, t2i_scores = get_sim( model.vision_proj(_pooled_image_feats), model.text_proj(text_feats[:, 0]) ) 对应的结果 感谢~

gyftsy commented 10 months ago

或者如果方便的话也可以加我的微信835781085哈 我这边整理了视频-文字相似度计算的demo 目前可以运行 但是结果不太理想 也可以share给大家(如果需要的话)

Andy1621 commented 10 months ago

https://github.com/OpenGVLab/unmasked_teacher/assets/43568835/89cbbf2b-608f-43e0-8170-f7cc7b165117 我使用了这个视频,对应的描述是dance,我这边测试了不用的样例,查看到视频和文字的相似度基本都在0.2-0.3之间,比较奇怪,但是我仍然没有定位到问题在哪里,目前我使用的是b16模型 bert对应也是base 如果可以的话,辛苦您这边可以帮忙测试一下这对数据的相似度 也就是代码中 i2t_scores, t2i_scores = get_sim( model.vision_proj(_pooled_image_feats), model.text_proj(text_feats[:, 0]) ) 对应的结果 感谢~

相似度在0.2-0.3之间对这个模型来说是正常的,我们在InternVid数据集中使用UMT统计过数据集的相似度分布,基本也在这个分布

gyftsy commented 10 months ago

所以我能这么理解么?这个模型大概是视频文本对相似的话0.3x 不相似的话0.2x么?

Andy1621 commented 10 months ago

可能不能看绝对值下定义,需要看相对值,相对高的为相似