Open pritamqu opened 2 months ago
Hi! Could you please provide the full log?
sure, please see the attachment. thanks msrvtt.log msvd.log
hey, thanks for checking. But, I think the difference in the number of steps (126 vs. 251) is due to the fact that I am using 4 GPUs and you used 8 GPUs. But that should not cause an error in the calculation, right?
Hi! I remember that when testing, the model is run at single GPU, since the evaluation uses model_without_ddp
and the test_loader does not have a sampler.
Thus the number should be the same whenever you use single or multiple GPUs.
Hey - thanks for your response.
So you're extracting features on single GPU mode as you don't have a ddp sampler on the loader during the test, but then you're using multiple GPUs if I am not wrong! please see this block of code: https://github.com/OpenGVLab/unmasked_teacher/blob/4fb4049f5a87919882e68ccc427615ae7dab1c33/multi_modality/tasks/retrieval_utils.py#L133C1-L151C1
# computes only part of the scores at each GPU, gather at the end
logger.info("Rerank dual-encoder results with cross-encoder...")
num_tasks = get_world_size()
rank = get_rank()
# only uses the part associated with the raw eval set
# compute image2text #
step = num_images // num_tasks + 1
start = rank * step
end = min(num_images, start + step)
text_encoder = model.get_text_encoder()
iterator = metric_logger.log_every(i2t_scores[start:end], 100, header)
logger.info(f"i2t_scores.shape {i2t_scores[start:end].shape}")
# generate score for each clip, and aggregate all clip scores for a video
n_clip_per_video = (
image_feats.shape[1] if not config.deep_fusion else image_feats[0].shape[1]
)
iterator
size depends on the number of GPUs used, the total samples are 1K for MSRVTT, which is why the size is 250
for me (when using 4 GPUs) and 125
for you (8 GPUs based on your bash script). nevertheless, this should not be an issue considering the rest of the calculation is okay.
I am also adding the config file that is generated, could you please see if any of the setups are different than yours, alternatively if you don't mind sharing the config file - I am wondering maybe there is a minor difference in the setup. zs_umt_msrvtt.json
To strike out any error that may be caused due to distributed job submission, I also ran on a single GPU, here is the result on MSRVTT - which is the same as earlier.
txt_r1 txt_r5 txt_r10 txt_r_mean img_r1 img_r5 img_r10 img_r_mean r_mean
test/ 20.0 40.9 51.4 37.43 24.5 41.1 46.5 37.37 37.40
test_emb/ 6.7 23.9 36.0 22.20 2.6 11.6 17.2 10.47 16.33
if both of our config files are the same, would you mind resubmitting your script entirely on a single GPU, e.g., by just setting NUM_GPUS=1
here.
Thanks for your help, appreciate it.
Sorry for the late response because of busy work. Here is my evaluation on single GPU:
txt_r1 txt_r5 txt_r10 txt_r_mean img_r1 img_r5 img_r10 img_r_mean r_mean
test/ 30.3 50.7 61.4 47.47 35.2 57.8 66.1 53.03 50.25
test_emb/ 26.2 47.6 56.4 43.40 27.7 51.4 61.3 46.80 45.10
Note that the evaluation runs on the unorganized repo. So, some parameter names may be different. Here is the generated config.json
.
{
"data_dir": "repo_path/anno",
"data_root": "repo_path/anno/videos_images",
"anno_root_pt": "repo_path/anno/anno_pretrain",
"anno_root_downstream": "repo_path/anno/anno_downstream",
"VisionEncoders": {
"beit": {
"name": "beit_base",
"pretrained": "microsoft/beit-base-patch16-224-pt22k-ft22k",
"d_model": 768
},
"beit_large": {
"name": "beit_large",
"pretrained": "microsoft/beit-large-patch16-224-pt22k-ft22k",
"d_model": 1024
}
},
"TextEncoders": {
"bert": {
"name": "bert_base",
"pretrained": "bert-base-uncased",
"config": "configs/config_bert.json",
"d_model": 768,
"fusion_layer": 9
},
"bert_large": {
"name": "bert_large",
"pretrained": "bert-large-uncased",
"config": "configs/config_bert_large.json",
"d_model": 1024,
"fusion_layer": 19
}
},
"train_file": [
"repo_path/anno/anno_downstream/msrvtt_ret_train9k.json",
"pvideo2:s3://msr-vtt/MSRVTT_Videos",
"video"
],
"test_file": {
"test": [
"repo_path/anno/anno_downstream/msrvtt_ret_test1k.json",
"pvideo2:s3://msr-vtt/MSRVTT_Videos",
"video"
]
},
"test_types": [
"test"
],
"num_workers": 6,
"stop_key": "test/",
"is_paragraph_retrieval": false,
"num_frames": 4,
"num_frames_test": 4,
"batch_size": 32,
"max_txt_l": 32,
"inputs": {
"image_res": 224,
"video_input": {
"num_frames": 4,
"sample_type": "rand",
"num_frames_test": 4,
"sample_type_test": "middle",
"random_aug": false
},
"max_txt_l": {
"image": 32,
"video": 32
},
"batch_size": {
"image": 32,
"video": 32
},
"batch_size_test": {
"image": 32,
"video": 32
}
},
"text_enc": "bert",
"model": {
"model_cls": "VindLU_VIT",
"vision_encoder": {
"name": "vit_b16",
"img_size": 224,
"patch_size": 16,
"d_model": 768,
"encoder_embed_dim": 768,
"encoder_depth": 12,
"encoder_num_heads": 12,
"drop_path_rate": 0.2,
"num_frames": 4,
"tubelet_size": 1,
"use_checkpoint": true,
"checkpoint_num": 12,
"clip_decoder_embed_dim": 768,
"clip_output_dim": 512,
"clip_return_layer": 0,
"clip_student_return_interval": 1,
"pretrained": "repo_path/anno/pretained_model/clipmae_vit_b16_k710_e200.pth",
"clip_teacher": "none",
"clip_img_size": 224,
"clip_return_interval": 1,
"video_mask_type": "attention",
"video_mask_ratio": 0.0,
"video_double_mask_ratio": 0.0,
"image_mask_type": "attention",
"image_mask_ratio": 0.0,
"image_double_mask_ratio": 0.0,
"keep_temporal": true
},
"text_encoder": {
"name": "bert_base",
"pretrained": "bert-base-uncased",
"config": "configs/config_bert.json",
"d_model": 768,
"fusion_layer": 9
},
"multimodal": {
"enable": true
},
"embed_dim": 512,
"temp": 0.07
},
"criterion": {
"loss_weight": {
"vtc": 1.0,
"mlm": 0.0,
"vtm": 1.0,
"mvm": 0.0,
"mac": 0.0
},
"vtm_hard_neg": true,
"mlm_masking_prob": 0.5,
"mac_norm_type": "l2",
"mac_loss_type": "l2"
},
"optimizer": {
"opt": "adamW",
"lr": 2e-05,
"opt_betas": [
0.9,
0.999
],
"weight_decay": 0.02,
"max_grad_norm": -1,
"different_lr": {
"enable": false,
"module_names": [],
"lr": 0.001
}
},
"scheduler": {
"sched": "cosine",
"epochs": 7,
"min_lr_multi": 0.01,
"warmup_epochs": 1
},
"evaluate": true,
"deep_fusion": false,
"evaluation": {
"eval_frame_ensemble": "concat",
"eval_x_only": false,
"k_test": 128,
"eval_offload": false
},
"fp16": true,
"gradient_checkpointing": true,
"wandb": {
"enable": false,
"entity": "likunchang",
"project": "vindlu_ret"
},
"dist_url": "env://",
"device": "cuda",
"mode": "pt",
"output_dir": "exp/exp_zs/msrvtt_zs/debug",
"resume": false,
"debug": false,
"log_freq": 100,
"seed": 42,
"zero_shot": true,
"save_latest": true,
"auto_resume": true,
"pretrained_path": "repo_path/exp/exp_pretrain_vit/vit_k710pre_d512_w25m_debug/vit_k710pre_d512_w25m_im0.5/ckpt_best.pth",
"rank": 0,
"world_size": 1,
"gpu": 0,
"distributed": true,
"dist_backend": "nccl"
}
Hey - I am unable to reproduce the reported zero-shot results. So far I tried it on MSRVTT and MSVD, I would appreciate it if you kindly have a look.
Here is what I got after running these 2 scripts, I kept the setup intact except set
eval_offload=False
- otherwise it was throwing an error, anyway, this should not be a problem.multi_modality/exp/zero_shot/ret_msrvtt/l16_25m.sh MSRVTT
multi_modality/exp/zero_shot/ret_msvd/l16_25m.sh MSVD
txt* is V2T, img* is T2V
These results should be comparable with the results of l16_25m (highlighted in red). But there is a large discrepancy. Please let me know in case I misinterpreted something.