Open LikeGiver opened 6 months ago
Hi! I have tried only to load the multimodal pretraining checkpoint (setting pretrained=None
), and it runs normally.
Hi! I have tried only to load the multimodal pretraining checkpoint (setting
pretrained=None
), and it runs normally.
Thank you for your timely response!
You mean set model.vision_encoder.pretrained=None? It's kind of counterintuitive and I still get pool results(maybe it's just the score of random guesses). With your debug msgs, I think there's something wrong in the loading of model, especially the vision part as the unexpected_keys in your debug log is really short, without things like 'vision_encoder.XX' , e.g. 'vision_encoder.layers.0.mixer.A_b_log'.
Here is my config debug msg (by the way, I only download bert-base-uncased):
2024-03-31T01:48:34 | utils.config_utils: config: {
data_dir: /home/ubuntu/data/user01/codes/VideoMamba/vindlu-data
data_root: /home/ubuntu/data/user01/codes/VideoMamba/vindlu-data/videos_images
anno_root_pt: /home/ubuntu/data/user01/codes/VideoMamba/vindlu-data/anno_pretrain
anno_root_downstream: /home/ubuntu/data/user01/codes/VideoMamba/vindlu-data/anno_downstream
TextEncoders: {
bert: {
name: bert_base
pretrained: bert-base-uncased
config: configs/config_bert.json
d_model: 768
fusion_layer: 9 }
bert_large: {
name: bert_large
pretrained: bert-large-uncased
config: configs/config_bert_large.json
d_model: 1024
fusion_layer: 19 } }
train_file: ['/home/ubuntu/data/user01/codes/VideoMamba/vindlu-data/anno_downstream/msrvtt_ret_train9k.json', '/home/ubuntu/data/user01/codes/VideoMamba/MSRVTT/videos/all', 'video']
test_file: {
test: ['/home/ubuntu/data/user01/codes/VideoMamba/vindlu-data/anno_downstream/msrvtt_ret_test1k.json', '/home/ubuntu/data/user01/codes/VideoMamba/MSRVTT/videos/all', 'video'] }
test_types: ['test']
num_workers: 6
stop_key: test/
is_paragraph_retrieval: False
num_frames: 8
num_frames_test: 8
batch_size: 64
max_txt_l: 32
inputs: {
image_res: 224
video_input: {
num_frames: 8
sample_type: rand
num_frames_test: 8
sample_type_test: middle
random_aug: False }
max_txt_l: {
image: 32
video: 32 }
batch_size: {
image: 64
video: 64 }
batch_size_test: {
image: 64
video: 64 } }
text_enc: bert
model: {
model_cls: UMT_VIDEOMAMBA
vision_encoder: {
name: videomamba_middle
img_size: 224
patch_size: 16
depth: 32
embed_dim: 576
drop_path_rate: 0.25
ssm_cfg: None
norm_epsilon: 1e-05
fused_add_norm: True
rms_norm: True
residual_in_fp32: True
bimamba_type: v2
pool_type: cls+avg
kernel_size: 1
num_frames: 8
ckpt_num_frame: 8
use_checkpoint: False
checkpoint_num: 0
clip_decoder_embed_dim: 576
clip_output_dim: 512
clip_norm_type: l2
clip_return_layer: 1
clip_student_return_interval: 1
pretrained: None # <------------- I've set it to None
clip_teacher: none
clip_img_size: 224
clip_return_interval: 1
video_mask_type: none
video_mask_ratio: 0.0
video_double_mask_ratio: 0.0
image_mask_type: none
image_mask_ratio: 0.0
image_double_mask_ratio: 0.0
keep_temporal: True }
text_encoder: {
name: bert_base
pretrained: bert-base-uncased
config: configs/config_bert.json
d_model: 768
fusion_layer: 9 }
multimodal: {
enable: True }
embed_dim: 512
temp: 0.07 }
criterion: {
loss_weight: {
vtc: 1.0
mlm: 1.0
vtm: 1.0
uta: 0.0 }
vtm_hard_neg: True
mlm_masking_prob: 0.5
uta_norm_type: l2
uta_loss_type: l2 }
optimizer: {
opt: adamW
lr: 1e-05
opt_betas: [0.9, 0.999]
weight_decay: 0.02
max_grad_norm: -1
different_lr: {
enable: False
module_names: []
lr: 0.004 } }
scheduler: {
sched: cosine
epochs: 2
min_lr_multi: 0.01
warmup_epochs: 0.2 }
evaluate: True
deep_fusion: False
evaluation: {
eval_frame_ensemble: concat
eval_x_only: False
k_test: 128
eval_offload: False }
fp16: True
bf16: True
gradient_checkpointing: True
wandb: {
enable: False
entity: likunchang
project: umt_videomamba }
dist_url: env://
device: cuda
mode: pt
output_dir: ./exp_zs/msrvtt/m16_5m
resume: False
debug: True
log_freq: 1
seed: 42
zero_shot: True
save_latest: True
auto_resume: True
pretrained_path: /home/ubuntu/data/user01/codes/VideoMamba/videomamba_m16_25M_f8_res224.pth
distributed: False }
I'm not sure whether you have changed the code. I simply use the file I uploaded to GitHub. The bug of bimamba_type
has been fixed and it does not affect the logic.
2024-03-30T23:48:51 | INFO | utils.config_utils : config: {
data_dir: your_data_path
data_root: your_data_path/videos_images
anno_root_pt: your_data_path/anno_pretrain
anno_root_downstream: your_data_path/anno_downstream
TextEncoders: {
bert: {
name: bert_base
pretrained: bert-base-uncased
config: configs/config_bert.json
d_model: 768
fusion_layer: 9 }
bert_large: {
name: bert_large
pretrained: bert-large-uncased
config: configs/config_bert_large.json
d_model: 1024
fusion_layer: 19 } }
train_file: ['your_data_path/anno_downstream/msrvtt_ret_train9k.json', 'p2:s3://MSR-VTT/MSRVTT_Videos', 'video']
test_file: {
test: ['your_data_path/anno_downstream/msrvtt_ret_test1k.json', 'p2:s3://MSR-VTT/MSRVTT_Videos', 'video'] }
test_types: ['test']
num_workers: 6
stop_key: test/
is_paragraph_retrieval: False
num_frames: 8
num_frames_test: 8
batch_size: 64
max_txt_l: 32
inputs: {
image_res: 224
video_input: {
num_frames: 8
sample_type: rand
num_frames_test: 8
sample_type_test: middle
random_aug: False }
max_txt_l: {
image: 32
video: 32 }
batch_size: {
image: 64
video: 64 }
batch_size_test: {
image: 64
video: 64 } }
text_enc: bert
model: {
model_cls: UMT_VIDEOMAMBA
vision_encoder: {
name: videomamba_middle
img_size: 224
patch_size: 16
depth: 32
embed_dim: 576
drop_path_rate: 0.25
ssm_cfg: None
norm_epsilon: 1e-05
fused_add_norm: True
rms_norm: True
residual_in_fp32: True
bimamba: True
pool_type: cls+avg
kernel_size: 1
num_frames: 8
ckpt_num_frame: 8
use_checkpoint: False
checkpoint_num: 0
clip_decoder_embed_dim: 576
clip_output_dim: 512
clip_norm_type: l2
clip_return_layer: 1
clip_student_return_interval: 1
pretrained: your_model_path/videomamba_m16_k400_mask_pt_f8_res224.pth
clip_teacher: none
clip_img_size: 224
clip_return_interval: 1
video_mask_type: none
video_mask_ratio: 0.0
video_double_mask_ratio: 0.0
image_mask_type: none
image_mask_ratio: 0.0
image_double_mask_ratio: 0.0
keep_temporal: True }
text_encoder: {
name: bert_base
pretrained: bert-base-uncased
config: configs/config_bert.json
d_model: 768
fusion_layer: 9 }
multimodal: {
enable: True }
embed_dim: 512
temp: 0.07 }
criterion: {
loss_weight: {
vtc: 1.0
mlm: 1.0
vtm: 1.0
uta: 0.0 }
vtm_hard_neg: True
mlm_masking_prob: 0.5
uta_norm_type: l2
uta_loss_type: l2 }
optimizer: {
opt: adamW
lr: 1e-05
opt_betas: [0.9, 0.999]
weight_decay: 0.02
max_grad_norm: -1
different_lr: {
enable: False
module_names: []
lr: 0.004 } }
scheduler: {
sched: cosine
epochs: 2
min_lr_multi: 0.01
warmup_epochs: 0.2 }
evaluate: True
deep_fusion: False
evaluation: {
eval_frame_ensemble: concat
eval_x_only: False
k_test: 128
eval_offload: False }
fp16: True
bf16: True
gradient_checkpointing: True
wandb: {
enable: False
entity: likunchang
project: umt_videomamba }
dist_url: env://
device: cuda
mode: pt
output_dir: exp_zs/debug/m16_5m
resume: False
debug: False
log_freq: 1
seed: 42
zero_shot: True
save_latest: True
auto_resume: True
pretrained_path: your_model_path/videomamba_m16_25M_f8_res224.pth
rank: 0
world_size: 1
gpu: 0
distributed: True
dist_backend: nccl }
Hello,
Thank you for your fast responses and answering my previous questions. I just wanted to quickly ask if you are able to replicate the reported ActivityNet results using this repo's code? I was able to replicate MSRVTT results thanks to your help above, so I am familiar with the repo, how to load the correct weights, and perform evaluation. However, when loading the same exact weights and running the activitynet zero-shot code, I get these bad results. I know I am not providing a full log, but can you similarly re-run your activitynet eval code and confirm that nothing is wrong there? I even did a clean re-pull of this repo, made only path changes, and still got these same results. I am using the 25M multi-modal weights. Thanks!
@NyleSiddiqui Please check the log here. It runs normally in my environment.
@NyleSiddiqui Please check the log here. It runs normally in my environment.
Thank you for checking for me! Must be something on my end, I will use the log to debug
@NyleSiddiqui Please check the log here. It runs normally in my environment.
Was this log replicated using the code in this repo, or your own local environment? My concern was that there may be a bug in the repo code which is not in your local environment, ESPECIALLY since I am able to replicate your results on MSRVTT with the same code I am using for ANet, and there are very little changes (basically only in the config and data paths) when switching from MSRVTT to ANet.
this is the results i've got on MSRVTT, which is really far worse than the paper results:
There must be something wrong in my test process and here's how i get this:
JOB_NAME='m16_5m' OUTPUT_DIR="$(dirname $0)/$JOB_NAME" LOG_DIR="$(dirname $0)/logs/${JOB_NAME}" NUM_GPUS=1 NUM_CPU=1
python tasks/retrieval.py \ $(dirname $0)/config.py \ output_dir ${OUTPUT_DIR} \ evaluate True \ zero_shot True \ pretrained_path /home/ubuntu/data/user01/codes/VideoMamba/videomamba_m16_25M_f8_res224.pth