Bad zero-shot text-to-video retrieval Results on MSRVTT

this is the results i've got on MSRVTT, which is really far worse than the paper results:

There must be something wrong in my test process and here's how i get this:

I've tried to run the text-to-video retrieval part, i use such a script to avoid use slurm:


export MASTER_PORT=$((12000 + $RANDOM % 20000))
export OMP_NUM_THREADS=1
echo "PYTHONPATH: ${PYTHONPATH}"
which_python=$(which python)
echo "which python: ${which_python}"
export PYTHONPATH=${PYTHONPATH}:${which_python}
export PYTHONPATH=${PYTHONPATH}:.
echo "PYTHONPATH: ${PYTHONPATH}"

JOB_NAME='m16_5m' OUTPUT_DIR="$(dirname $0)/$JOB_NAME" LOG_DIR="$(dirname $0)/logs/${JOB_NAME}" NUM_GPUS=1 NUM_CPU=1

python tasks/retrieval.py \ $(dirname $0)/config.py \ output_dir ${OUTPUT_DIR} \ evaluate True \ zero_shot True \ pretrained_path /home/ubuntu/data/user01/codes/VideoMamba/videomamba_m16_25M_f8_res224.pth


2. Then I set the config.py
It's confusing that i need to use the params from **videomamba_m16_k400_mask_pt_f8_res224.pth** (I set the path in 'pretrained' in the corresponding config.py) , i only download **videomamba_m16_25M_f8_res224.pth** at first, but the error msg tells me to change the 'pretrained' value in the 'videomamba/video_mm/exp_zs/msrvtt/config.py', whose default value is 
<img width="852" alt="Screenshot 2024-03-30 at 21 51 42" src="https://github.com/OpenGVLab/VideoMamba/assets/42694182/a41a1eaf-fd1d-4e73-8386-295f8e470387">
I've tried to only use model params from videomamba_m16_25M_f8_res224.pth but it does have the key 'pos_embed':
<img width="872" alt="Screenshot 2024-03-30 at 21 55 25" src="https://github.com/OpenGVLab/VideoMamba/assets/42694182/eef659f2-743b-45ad-8034-a424435f6f14">
and there are some debug msg making me worried:
<img width="861" alt="Screenshot 2024-03-30 at 22 00 15" src="https://github.com/OpenGVLab/VideoMamba/assets/42694182/3dc5fc4b-43c8-47c3-b5cb-7f313b7d4145">
3. I get the dataset directly from [VindLU](https://github.com/klauscc/VindLU/?tab=readme-ov-file) and I forcefully fixed some bugs in this inference process to make this code run, but they may not be the cause I think. (the bimamba type things or the cuda drive version things)

Forgive me if I make some basic errors and I didn't read the paper throughout, and i am gonna do that.

Hi! I have tried only to load the multimodal pretraining checkpoint (setting pretrained=None), and it runs normally.

Hi! I have tried only to load the multimodal pretraining checkpoint (setting pretrained=None), and it runs normally.

Thank you for your timely response!

You mean set model.vision_encoder.pretrained=None? It's kind of counterintuitive and I still get pool results(maybe it's just the score of random guesses). With your debug msgs, I think there's something wrong in the loading of model, especially the vision part as the unexpected_keys in your debug log is really short, without things like 'vision_encoder.XX' , e.g. 'vision_encoder.layers.0.mixer.A_b_log'.

Here is my config debug msg (by the way, I only download bert-base-uncased):

2024-03-31T01:48:34 | utils.config_utils: config: {
  data_dir: /home/ubuntu/data/user01/codes/VideoMamba/vindlu-data
  data_root: /home/ubuntu/data/user01/codes/VideoMamba/vindlu-data/videos_images
  anno_root_pt: /home/ubuntu/data/user01/codes/VideoMamba/vindlu-data/anno_pretrain
  anno_root_downstream: /home/ubuntu/data/user01/codes/VideoMamba/vindlu-data/anno_downstream
  TextEncoders: {
      bert: {
          name: bert_base
          pretrained: bert-base-uncased
          config: configs/config_bert.json
          d_model: 768
          fusion_layer: 9 }
      bert_large: {
          name: bert_large
          pretrained: bert-large-uncased
          config: configs/config_bert_large.json
          d_model: 1024
          fusion_layer: 19 } }
  train_file: ['/home/ubuntu/data/user01/codes/VideoMamba/vindlu-data/anno_downstream/msrvtt_ret_train9k.json', '/home/ubuntu/data/user01/codes/VideoMamba/MSRVTT/videos/all', 'video']
  test_file: {
      test: ['/home/ubuntu/data/user01/codes/VideoMamba/vindlu-data/anno_downstream/msrvtt_ret_test1k.json', '/home/ubuntu/data/user01/codes/VideoMamba/MSRVTT/videos/all', 'video'] }
  test_types: ['test']
  num_workers: 6
  stop_key: test/
  is_paragraph_retrieval: False
  num_frames: 8
  num_frames_test: 8
  batch_size: 64
  max_txt_l: 32
  inputs: {
      image_res: 224
      video_input: {
          num_frames: 8
          sample_type: rand
          num_frames_test: 8
          sample_type_test: middle
          random_aug: False }
      max_txt_l: {
          image: 32
          video: 32 }
      batch_size: {
          image: 64
          video: 64 }
      batch_size_test: {
          image: 64
          video: 64 } }
  text_enc: bert
  model: {
      model_cls: UMT_VIDEOMAMBA
      vision_encoder: {
          name: videomamba_middle
          img_size: 224
          patch_size: 16
          depth: 32
          embed_dim: 576
          drop_path_rate: 0.25
          ssm_cfg: None
          norm_epsilon: 1e-05
          fused_add_norm: True
          rms_norm: True
          residual_in_fp32: True
          bimamba_type: v2
          pool_type: cls+avg
          kernel_size: 1
          num_frames: 8
          ckpt_num_frame: 8
          use_checkpoint: False
          checkpoint_num: 0
          clip_decoder_embed_dim: 576
          clip_output_dim: 512
          clip_norm_type: l2
          clip_return_layer: 1
          clip_student_return_interval: 1
          pretrained: None                           # <------------- I've set it to None
          clip_teacher: none
          clip_img_size: 224
          clip_return_interval: 1
          video_mask_type: none
          video_mask_ratio: 0.0
          video_double_mask_ratio: 0.0
          image_mask_type: none
          image_mask_ratio: 0.0
          image_double_mask_ratio: 0.0
          keep_temporal: True }
      text_encoder: {
          name: bert_base
          pretrained: bert-base-uncased
          config: configs/config_bert.json
          d_model: 768
          fusion_layer: 9 }
      multimodal: {
          enable: True }
      embed_dim: 512
      temp: 0.07 }
  criterion: {
      loss_weight: {
          vtc: 1.0
          mlm: 1.0
          vtm: 1.0
          uta: 0.0 }
      vtm_hard_neg: True
      mlm_masking_prob: 0.5
      uta_norm_type: l2
      uta_loss_type: l2 }
  optimizer: {
      opt: adamW
      lr: 1e-05
      opt_betas: [0.9, 0.999]
      weight_decay: 0.02
      max_grad_norm: -1
      different_lr: {
          enable: False
          module_names: []
          lr: 0.004 } }
  scheduler: {
      sched: cosine
      epochs: 2
      min_lr_multi: 0.01
      warmup_epochs: 0.2 }
  evaluate: True
  deep_fusion: False
  evaluation: {
      eval_frame_ensemble: concat
      eval_x_only: False
      k_test: 128
      eval_offload: False }
  fp16: True
  bf16: True
  gradient_checkpointing: True
  wandb: {
      enable: False
      entity: likunchang
      project: umt_videomamba }
  dist_url: env://
  device: cuda
  mode: pt
  output_dir: ./exp_zs/msrvtt/m16_5m
  resume: False
  debug: True
  log_freq: 1
  seed: 42
  zero_shot: True
  save_latest: True
  auto_resume: True
  pretrained_path: /home/ubuntu/data/user01/codes/VideoMamba/videomamba_m16_25M_f8_res224.pth
  distributed: False }

I'm not sure whether you have changed the code. I simply use the file I uploaded to GitHub. The bug of bimamba_type has been fixed and it does not affect the logic.

2024-03-30T23:48:51 | INFO | utils.config_utils : config: {
  data_dir: your_data_path
  data_root: your_data_path/videos_images
  anno_root_pt: your_data_path/anno_pretrain
  anno_root_downstream: your_data_path/anno_downstream
  TextEncoders: {
      bert: {
          name: bert_base
          pretrained: bert-base-uncased
          config: configs/config_bert.json
          d_model: 768
          fusion_layer: 9 }
      bert_large: {
          name: bert_large
          pretrained: bert-large-uncased
          config: configs/config_bert_large.json
          d_model: 1024
          fusion_layer: 19 } }
  train_file: ['your_data_path/anno_downstream/msrvtt_ret_train9k.json', 'p2:s3://MSR-VTT/MSRVTT_Videos', 'video']
  test_file: {
      test: ['your_data_path/anno_downstream/msrvtt_ret_test1k.json', 'p2:s3://MSR-VTT/MSRVTT_Videos', 'video'] }
  test_types: ['test']
  num_workers: 6
  stop_key: test/
  is_paragraph_retrieval: False
  num_frames: 8
  num_frames_test: 8
  batch_size: 64
  max_txt_l: 32
  inputs: {
      image_res: 224
      video_input: {
          num_frames: 8
          sample_type: rand
          num_frames_test: 8
          sample_type_test: middle
          random_aug: False }
      max_txt_l: {
          image: 32
          video: 32 }
      batch_size: {
          image: 64
          video: 64 }
      batch_size_test: {
          image: 64
          video: 64 } }
  text_enc: bert
  model: {
      model_cls: UMT_VIDEOMAMBA
      vision_encoder: {
          name: videomamba_middle
          img_size: 224
          patch_size: 16
          depth: 32
          embed_dim: 576
          drop_path_rate: 0.25
          ssm_cfg: None
          norm_epsilon: 1e-05
          fused_add_norm: True
          rms_norm: True
          residual_in_fp32: True
          bimamba: True
          pool_type: cls+avg
          kernel_size: 1
          num_frames: 8
          ckpt_num_frame: 8
          use_checkpoint: False
          checkpoint_num: 0
          clip_decoder_embed_dim: 576
          clip_output_dim: 512
          clip_norm_type: l2
          clip_return_layer: 1
          clip_student_return_interval: 1
          pretrained: your_model_path/videomamba_m16_k400_mask_pt_f8_res224.pth
          clip_teacher: none
          clip_img_size: 224
          clip_return_interval: 1
          video_mask_type: none
          video_mask_ratio: 0.0
          video_double_mask_ratio: 0.0
          image_mask_type: none
          image_mask_ratio: 0.0
          image_double_mask_ratio: 0.0
          keep_temporal: True }
      text_encoder: {
          name: bert_base
          pretrained: bert-base-uncased
          config: configs/config_bert.json
          d_model: 768
          fusion_layer: 9 }
      multimodal: {
          enable: True }
      embed_dim: 512
      temp: 0.07 }
  criterion: {
      loss_weight: {
          vtc: 1.0
          mlm: 1.0
          vtm: 1.0
          uta: 0.0 }
      vtm_hard_neg: True
      mlm_masking_prob: 0.5
      uta_norm_type: l2
      uta_loss_type: l2 }
  optimizer: {
      opt: adamW
      lr: 1e-05
      opt_betas: [0.9, 0.999]
      weight_decay: 0.02
      max_grad_norm: -1
      different_lr: {
          enable: False
          module_names: []
          lr: 0.004 } }
  scheduler: {
      sched: cosine
      epochs: 2
      min_lr_multi: 0.01
      warmup_epochs: 0.2 }
  evaluate: True
  deep_fusion: False
  evaluation: {
      eval_frame_ensemble: concat
      eval_x_only: False
      k_test: 128
      eval_offload: False }
  fp16: True
  bf16: True
  gradient_checkpointing: True
  wandb: {
      enable: False
      entity: likunchang
      project: umt_videomamba }
  dist_url: env://
  device: cuda
  mode: pt
  output_dir: exp_zs/debug/m16_5m
  resume: False
  debug: False
  log_freq: 1
  seed: 42
  zero_shot: True
  save_latest: True
  auto_resume: True
  pretrained_path: your_model_path/videomamba_m16_25M_f8_res224.pth
  rank: 0
  world_size: 1
  gpu: 0
  distributed: True
  dist_backend: nccl }

Hello,

Thank you for your fast responses and answering my previous questions. I just wanted to quickly ask if you are able to replicate the reported ActivityNet results using this repo's code? I was able to replicate MSRVTT results thanks to your help above, so I am familiar with the repo, how to load the correct weights, and perform evaluation. However, when loading the same exact weights and running the activitynet zero-shot code, I get these bad results. I know I am not providing a full log, but can you similarly re-run your activitynet eval code and confirm that nothing is wrong there? I even did a clean re-pull of this repo, made only path changes, and still got these same results. I am using the 25M multi-modal weights. Thanks!

@NyleSiddiqui Please check the log here. It runs normally in my environment.

@NyleSiddiqui Please check the log here. It runs normally in my environment.

Thank you for checking for me! Must be something on my end, I will use the log to debug

@NyleSiddiqui Please check the log here. It runs normally in my environment.

Was this log replicated using the code in this repo, or your own local environment? My concern was that there may be a bug in the repo code which is not in your local environment, ESPECIALLY since I am able to replicate your results on MSRVTT with the same code I am using for ANet, and there are very little changes (basically only in the config and data paths) when switching from MSRVTT to ANet.

OpenGVLab / VideoMamba

Bad zero-shot text-to-video retrieval Results on MSRVTT #21