ArrowLuo / CLIP4Clip

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"
https://arxiv.org/abs/2104.08860
MIT License
879 stars 123 forks source link

Confusion about the performance of MSR-VTT #63

Closed StOnEGiggity closed 2 years ago

StOnEGiggity commented 2 years ago

Hi,

Thanks for your excellent work. I have a few questions when I re-implement CLIP4Clip on the MSR-VTT dataset.

Firstly, I change sim_header to seqTransf to implement the best performance of t2v in the paper, but I obtain R@1 of 43.4% instead of 44.5% in the paper. I also try different learning rates, (e.g, 5e-9, 1e-8, 5e-7, 1e-6, 5e-6), while observing 1e-7 is still the best. Is there anything different between the default config and the configuration in the paper? By the way, I use 4 V100 GPUs and batch size 128 for training.

Secondly, I also try the option compression to speed up training. Although achieving R@1 of 44% in the compressed val set, I observe the performance degeneration in the original val set without additional fine-tuning. I understand there may be a gap between different video sets. Thus, is that reasonable to train models on compressed videos?

Thanks a lot. If there exist some mistakes in my comment, please tell me directly)

ArrowLuo commented 2 years ago

Hi @StOnEGiggity, thanks for your question. A suggestion is to read the https://github.com/openai/CLIP/issues/114, CUDA's nondeterministic behavior, and https://github.com/ArrowLuo/CLIP4Clip/issues/25. In a word, it is indeed a problem with stable reproduction. The different results on the compressed videos and original videos may be caused by this problem.

StOnEGiggity commented 2 years ago

Thanks for your quick reply. I notice there exist some differences between your log and my log because I use multiple machines for training. The num_steps is different, which leads to different lr_scheduler. I will try to re-implement the result by changing lr_scheduler manually. Cheers )

Jeff-LiangF commented 2 years ago

Hey @ArrowLuo ,

Thanks for your great repo! I am trying to reproduce your MSRVTT results on train.9k.csv. I basically follow your script:

python -m torch.distributed.launch --nproc_per_node=4 \ main_task_retrieval.py --do_train --num_thread_reader=4 \ --epochs=5 --batch_size=128 --n_display=50 \ --train_csv ${DATA_PATH}/MSRVTT_train.9k.csv \ --val_csv ${DATA_PATH}/MSRVTT_JSFUSION_test.csv \ --data_path ${DATA_PATH}/MSRVTT_data.json \ --features_path /data/MSRVTT/videos/all \ --output_dir ckpts/ckpt_msrvtt_retrieval_looseType \ --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \ --datatype msrvtt --expand_msrvtt_sentences \ --feature_framerate 1 --coef_lr 1e-3 \ --freeze_layer_num 0 --slice_framepos 2 \ --loose_type --linear_patch 2d --sim_header meanP \ --pretrained_clip_name ViT-B/32

However, it is frustrating to see a very bad text-to-video recall R@1 35.9, but a very high video-to-text recall V2T$R@1: 64.1. Do you have any idea why did this happen? The following is a snap of my training_log:

image

SCZwangxiao commented 1 year ago

Hey @ArrowLuo ,

Thanks for your great repo! I am trying to reproduce your MSRVTT results on train.9k.csv. I basically follow your script:

python -m torch.distributed.launch --nproc_per_node=4 \ main_task_retrieval.py --do_train --num_thread_reader=4 \ --epochs=5 --batch_size=128 --n_display=50 \ --train_csv ${DATA_PATH}/MSRVTT_train.9k.csv \ --val_csv ${DATA_PATH}/MSRVTT_JSFUSION_test.csv \ --data_path ${DATA_PATH}/MSRVTT_data.json \ --features_path /data/MSRVTT/videos/all \ --output_dir ckpts/ckpt_msrvtt_retrieval_looseType \ --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \ --datatype msrvtt --expand_msrvtt_sentences \ --feature_framerate 1 --coef_lr 1e-3 \ --freeze_layer_num 0 --slice_framepos 2 \ --loose_type --linear_patch 2d --sim_header meanP \ --pretrained_clip_name ViT-B/32

However, it is frustrating to see a very bad text-to-video recall R@1 35.9, but a very high video-to-text recall V2T$R@1: 64.1. Do you have any idea why did this happen? The following is a snap of my training_log:

image

Have you reproduced the results? I got similarly bad results like you.

deepalchemist commented 7 months ago

Hey @ArrowLuo ,

Thanks for your great repo! I am trying to reproduce your MSRVTT results on train.9k.csv. I basically follow your script:

python -m torch.distributed.launch --nproc_per_node=4 \ main_task_retrieval.py --do_train --num_thread_reader=4 \ --epochs=5 --batch_size=128 --n_display=50 \ --train_csv ${DATA_PATH}/MSRVTT_train.9k.csv \ --val_csv ${DATA_PATH}/MSRVTT_JSFUSION_test.csv \ --data_path ${DATA_PATH}/MSRVTT_data.json \ --features_path /data/MSRVTT/videos/all \ --output_dir ckpts/ckpt_msrvtt_retrieval_looseType \ --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \ --datatype msrvtt --expand_msrvtt_sentences \ --feature_framerate 1 --coef_lr 1e-3 \ --freeze_layer_num 0 --slice_framepos 2 \ --loose_type --linear_patch 2d --sim_header meanP \ --pretrained_clip_name ViT-B/32

However, it is frustrating to see a very bad text-to-video recall R@1 35.9, but a very high video-to-text recall V2T$R@1: 64.1. Do you have any idea why did this happen? The following is a snap of my training_log:

image

It seems that setting num_thread_reader>0 results in worse accuracy than num_threads reader=0. But I don't know why? If you know please tell me.