OpenGVLab / unmasked_teacher

[ICCV2023 Oral] Unmasked Teacher: Towards Training-Efficient Video Foundation Models
https://arxiv.org/abs/2303.16058
MIT License
292 stars 15 forks source link

unable to reproduce zero-shot results #49

Open pritamqu opened 2 months ago

pritamqu commented 2 months ago

Hey - I am unable to reproduce the reported zero-shot results. So far I tried it on MSRVTT and MSVD, I would appreciate it if you kindly have a look.

Here is what I got after running these 2 scripts, I kept the setup intact except set eval_offload=False - otherwise it was throwing an error, anyway, this should not be a problem.

multi_modality/exp/zero_shot/ret_msrvtt/l16_25m.sh MSRVTT

           txt_r1  txt_r5  txt_r10  txt_r_mean  img_r1  img_r5  img_r10  img_r_mean  r_mean
test/        20.0    40.9     51.4       37.43    24.5    41.1     46.5       37.37   37.40
test_emb/     6.7    23.9     36.0       22.20     2.6    11.6     17.2       10.47   16.33

multi_modality/exp/zero_shot/ret_msvd/l16_25m.sh MSVD

           txt_r1  txt_r5  txt_r10  txt_r_mean  img_r1  img_r5  img_r10  img_r_mean  r_mean
test/       50.15   74.33    83.58       69.35   36.56   57.49    63.18       52.41   60.88
test_emb/   17.31   47.61    63.43       42.79    4.82   14.35    20.70       13.29   28.04

txt* is V2T, img* is T2V

MSRVTT R@1 R@5 R@10
T2V 24.5 41.1 46.5
V2T 20.0 40.9 51.4
MSVD R@1 R@5 R@10
T2V 36.56 57.49 63.18
V2T 50.15 74.33 83.58

These results should be comparable with the results of l16_25m (highlighted in red). But there is a large discrepancy. Please let me know in case I misinterpreted something.

Screenshot 2024-08-31 at 6 47 41 PM

Andy1621 commented 2 months ago

Hi! Could you please provide the full log?

pritamqu commented 2 months ago

sure, please see the attachment. thanks msrvtt.log msvd.log

Andy1621 commented 2 months ago

Hi! I have checked my log and found that the video numbers are different. I'm not sure whether there is a bug in code or annotation.

image image

My annotation can be found here: MSRVTT, MSVD.

pritamqu commented 2 months ago

hey, thanks for checking. But, I think the difference in the number of steps (126 vs. 251) is due to the fact that I am using 4 GPUs and you used 8 GPUs. But that should not cause an error in the calculation, right?

Andy1621 commented 2 months ago

Hi! I remember that when testing, the model is run at single GPU, since the evaluation uses model_without_ddp and the test_loader does not have a sampler.

Thus the number should be the same whenever you use single or multiple GPUs.

pritamqu commented 1 month ago

Hey - thanks for your response.

So you're extracting features on single GPU mode as you don't have a ddp sampler on the loader during the test, but then you're using multiple GPUs if I am not wrong! please see this block of code: https://github.com/OpenGVLab/unmasked_teacher/blob/4fb4049f5a87919882e68ccc427615ae7dab1c33/multi_modality/tasks/retrieval_utils.py#L133C1-L151C1

    # computes only part of the scores at each GPU, gather at the end
    logger.info("Rerank dual-encoder results with cross-encoder...")
    num_tasks = get_world_size()
    rank = get_rank()
    # only uses the part associated with the raw eval set
    # compute image2text #
    step = num_images // num_tasks + 1
    start = rank * step
    end = min(num_images, start + step)

    text_encoder = model.get_text_encoder()
    iterator = metric_logger.log_every(i2t_scores[start:end], 100, header)
    logger.info(f"i2t_scores.shape {i2t_scores[start:end].shape}")

    # generate score for each clip, and aggregate all clip scores for a video
    n_clip_per_video = (
        image_feats.shape[1] if not config.deep_fusion else image_feats[0].shape[1]
    )

iterator size depends on the number of GPUs used, the total samples are 1K for MSRVTT, which is why the size is 250 for me (when using 4 GPUs) and 125 for you (8 GPUs based on your bash script). nevertheless, this should not be an issue considering the rest of the calculation is okay.

I am also adding the config file that is generated, could you please see if any of the setups are different than yours, alternatively if you don't mind sharing the config file - I am wondering maybe there is a minor difference in the setup. zs_umt_msrvtt.json

To strike out any error that may be caused due to distributed job submission, I also ran on a single GPU, here is the result on MSRVTT - which is the same as earlier.

           txt_r1  txt_r5  txt_r10  txt_r_mean  img_r1  img_r5  img_r10  img_r_mean  r_mean
test/        20.0    40.9     51.4       37.43    24.5    41.1     46.5       37.37   37.40
test_emb/     6.7    23.9     36.0       22.20     2.6    11.6     17.2       10.47   16.33

if both of our config files are the same, would you mind resubmitting your script entirely on a single GPU, e.g., by just setting NUM_GPUS=1 here.

Thanks for your help, appreciate it.

Andy1621 commented 1 month ago

Sorry for the late response because of busy work. Here is my evaluation on single GPU:

           txt_r1  txt_r5  txt_r10  txt_r_mean  img_r1  img_r5  img_r10  img_r_mean  r_mean
test/        30.3    50.7     61.4       47.47    35.2    57.8     66.1       53.03   50.25
test_emb/    26.2    47.6     56.4       43.40    27.7    51.4     61.3       46.80   45.10

train.log

Andy1621 commented 1 month ago

Note that the evaluation runs on the unorganized repo. So, some parameter names may be different. Here is the generated config.json.

{
  "data_dir": "repo_path/anno",
  "data_root": "repo_path/anno/videos_images",
  "anno_root_pt": "repo_path/anno/anno_pretrain",
  "anno_root_downstream": "repo_path/anno/anno_downstream",
  "VisionEncoders": {
    "beit": {
      "name": "beit_base",
      "pretrained": "microsoft/beit-base-patch16-224-pt22k-ft22k",
      "d_model": 768
    },
    "beit_large": {
      "name": "beit_large",
      "pretrained": "microsoft/beit-large-patch16-224-pt22k-ft22k",
      "d_model": 1024
    }
  },
  "TextEncoders": {
    "bert": {
      "name": "bert_base",
      "pretrained": "bert-base-uncased",
      "config": "configs/config_bert.json",
      "d_model": 768,
      "fusion_layer": 9
    },
    "bert_large": {
      "name": "bert_large",
      "pretrained": "bert-large-uncased",
      "config": "configs/config_bert_large.json",
      "d_model": 1024,
      "fusion_layer": 19
    }
  },
  "train_file": [
    "repo_path/anno/anno_downstream/msrvtt_ret_train9k.json",
    "pvideo2:s3://msr-vtt/MSRVTT_Videos",
    "video"
  ],
  "test_file": {
    "test": [
      "repo_path/anno/anno_downstream/msrvtt_ret_test1k.json",
      "pvideo2:s3://msr-vtt/MSRVTT_Videos",
      "video"
    ]
  },
  "test_types": [
    "test"
  ],
  "num_workers": 6,
  "stop_key": "test/",
  "is_paragraph_retrieval": false,
  "num_frames": 4,
  "num_frames_test": 4,
  "batch_size": 32,
  "max_txt_l": 32,
  "inputs": {
    "image_res": 224,
    "video_input": {
      "num_frames": 4,
      "sample_type": "rand",
      "num_frames_test": 4,
      "sample_type_test": "middle",
      "random_aug": false
    },
    "max_txt_l": {
      "image": 32,
      "video": 32
    },
    "batch_size": {
      "image": 32,
      "video": 32
    },
    "batch_size_test": {
      "image": 32,
      "video": 32
    }
  },
  "text_enc": "bert",
  "model": {
    "model_cls": "VindLU_VIT",
    "vision_encoder": {
      "name": "vit_b16",
      "img_size": 224,
      "patch_size": 16,
      "d_model": 768,
      "encoder_embed_dim": 768,
      "encoder_depth": 12,
      "encoder_num_heads": 12,
      "drop_path_rate": 0.2,
      "num_frames": 4,
      "tubelet_size": 1,
      "use_checkpoint": true,
      "checkpoint_num": 12,
      "clip_decoder_embed_dim": 768,
      "clip_output_dim": 512,
      "clip_return_layer": 0,
      "clip_student_return_interval": 1,
      "pretrained": "repo_path/anno/pretained_model/clipmae_vit_b16_k710_e200.pth",
      "clip_teacher": "none",
      "clip_img_size": 224,
      "clip_return_interval": 1,
      "video_mask_type": "attention",
      "video_mask_ratio": 0.0,
      "video_double_mask_ratio": 0.0,
      "image_mask_type": "attention",
      "image_mask_ratio": 0.0,
      "image_double_mask_ratio": 0.0,
      "keep_temporal": true
    },
    "text_encoder": {
      "name": "bert_base",
      "pretrained": "bert-base-uncased",
      "config": "configs/config_bert.json",
      "d_model": 768,
      "fusion_layer": 9
    },
    "multimodal": {
      "enable": true
    },
    "embed_dim": 512,
    "temp": 0.07
  },
  "criterion": {
    "loss_weight": {
      "vtc": 1.0,
      "mlm": 0.0,
      "vtm": 1.0,
      "mvm": 0.0,
      "mac": 0.0
    },
    "vtm_hard_neg": true,
    "mlm_masking_prob": 0.5,
    "mac_norm_type": "l2",
    "mac_loss_type": "l2"
  },
  "optimizer": {
    "opt": "adamW",
    "lr": 2e-05,
    "opt_betas": [
      0.9,
      0.999
    ],
    "weight_decay": 0.02,
    "max_grad_norm": -1,
    "different_lr": {
      "enable": false,
      "module_names": [],
      "lr": 0.001
    }
  },
  "scheduler": {
    "sched": "cosine",
    "epochs": 7,
    "min_lr_multi": 0.01,
    "warmup_epochs": 1
  },
  "evaluate": true,
  "deep_fusion": false,
  "evaluation": {
    "eval_frame_ensemble": "concat",
    "eval_x_only": false,
    "k_test": 128,
    "eval_offload": false
  },
  "fp16": true,
  "gradient_checkpointing": true,
  "wandb": {
    "enable": false,
    "entity": "likunchang",
    "project": "vindlu_ret"
  },
  "dist_url": "env://",
  "device": "cuda",
  "mode": "pt",
  "output_dir": "exp/exp_zs/msrvtt_zs/debug",
  "resume": false,
  "debug": false,
  "log_freq": 100,
  "seed": 42,
  "zero_shot": true,
  "save_latest": true,
  "auto_resume": true,
  "pretrained_path": "repo_path/exp/exp_pretrain_vit/vit_k710pre_d512_w25m_debug/vit_k710pre_d512_w25m_im0.5/ckpt_best.pth",
  "rank": 0,
  "world_size": 1,
  "gpu": 0,
  "distributed": true,
  "dist_backend": "nccl"
}