ArrowLuo / CLIP4Clip

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"
https://arxiv.org/abs/2104.08860
MIT License
851 stars 121 forks source link

Some questions about the results of the MARVTT with `sim_header seqTransf`. #25

Closed sqiangcao99 closed 3 years ago

sqiangcao99 commented 3 years ago

When I use the following configuration to train the model on MSRVTT Training-9K, the best result I got is 07/27/2021 13:11:01 - INFO - sim matrix size: 1000, 1000 07/27/2021 13:11:01 - INFO - Length-T: 1000, Length-V:1000 07/27/2021 13:11:01 - INFO - Text-to-Video: 07/27/2021 13:11:01 - INFO - >>> R@1: 43.2 - R@5: 71.0 - R@10: 79.4 - Median R: 2.0 - Mean R: 15.4 07/27/2021 13:11:01 - INFO - Video-to-Text: 07/27/2021 13:11:01 - INFO - >>> V2T$R@1: 43.1 - V2T$R@5: 71.2 - V2T$R@10: 80.7 - V2T$Median R: 2.0 - V2T$Mean R: 11.9. It's worse than the results R@1: 44.5 listed in the paper. Did i miss some details? Here is the configuration. CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_addr=127.0.0.2 --master_port 29552 main_ta sk_retrieval.py --num_thread_reader=4 --epochs=5 --batch_size=128 --n_display=20 --train_csv /home/hadoop-vacv/cephfs/data/caoshuqia ng/data/jobs/MSRVTT/csv/msrvtt_data/MSRVTT_train.9k.csv --val_csv /home/hadoop-vacv/cephfs/data/caoshuqiang/data/jobs/MSRVTT/csv/msr vtt_data/MSRVTT_JSFUSION_test.csv --data_path /home/hadoop-vacv/cephfs/data/caoshuqiang/data/jobs/MSRVTT/csv/msrvtt_data/MSRVTT_data .json --features_path /home/hadoop-vacv/cephfs/data/caoshuqiang/data/jobs/MSRVTT/MSRVTT_Videos --output_dir /home/hadoop-vacv/cephfs /data/caoshuqiang/code/vicab/newexp/hope/clip_raw --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 12 --datatype msrvtt -- expand_msrvtt_sentences --feature_framerate 1 --coef_lr 1e-3 --freeze_layer_num 0 --slice_framepos 2 --loose_type --linear_patch 2d --sim_header seqTransf --do_train.

ArrowLuo commented 3 years ago

Hi @sqiangcao99, I can not find the different settings from ours except for the GPUs number, and I am not sure whether it will affect the performance and the performance discrepancy is reasonable. How about your other sim_header's results and can you print your logs of this running before the first epoch?

sqiangcao99 commented 3 years ago

07/27/2021 09:21:49 - INFO - Effective parameters: 07/27/2021 09:21:49 - INFO - <<< batch_size: 128 07/27/2021 09:21:49 - INFO - <<< batch_size_val: 12 07/27/2021 09:21:49 - INFO - <<< cache_dir: 07/27/2021 09:21:49 - INFO - <<< coef_lr: 0.001 07/27/2021 09:21:49 - INFO - <<< cross_model: cross-base 07/27/2021 09:21:49 - INFO - <<< cross_num_hidden_layers: 4

07/27/2021 09:21:49 - INFO - <<< datatype: msrvtt 07/27/2021 09:21:49 - INFO - <<< do_eval: False 07/27/2021 09:21:49 - INFO - device: cuda:1 n_gpu: 2 07/27/2021 09:21:49 - INFO - <<< do_lower_case: False 07/27/2021 09:21:49 - INFO - <<< do_pretrain: False 07/27/2021 09:21:49 - INFO - <<< do_train: True 07/27/2021 09:21:49 - INFO - <<< epochs: 5 07/27/2021 09:21:49 - INFO - <<< eval_frame_order: 0 07/27/2021 09:21:49 - INFO - <<< expand_msrvtt_sentences: True 07/27/2021 09:21:49 - INFO - <<< feature_framerate: 1

07/27/2021 09:21:49 - INFO - <<< fp16: False 07/27/2021 09:21:49 - INFO - <<< fp16_opt_level: O1 07/27/2021 09:21:49 - INFO - <<< freeze_layer_num: 0 07/27/2021 09:21:49 - INFO - <<< gradient_accumulation_steps: 1 07/27/2021 09:21:49 - INFO - <<< hard_negative_rate: 0.5 07/27/2021 09:21:49 - INFO - <<< init_model: None 07/27/2021 09:21:49 - INFO - <<< linear_patch: 2d 07/27/2021 09:21:49 - INFO - <<< local_rank: 0 07/27/2021 09:21:49 - INFO - <<< loose_type: True 07/27/2021 09:21:49 - INFO - <<< lr: 0.0001 07/27/2021 09:21:49 - INFO - <<< lr_decay: 0.9 07/27/2021 09:21:49 - INFO - <<< margin: 0.1 07/27/2021 09:21:49 - INFO - <<< max_frames: 12 07/27/2021 09:21:49 - INFO - <<< max_words: 32 07/27/2021 09:21:49 - INFO - <<< n_display: 20 07/27/2021 09:21:49 - INFO - <<< n_gpu: 1 07/27/2021 09:21:49 - INFO - <<< n_pair: 1 07/27/2021 09:21:49 - INFO - <<< negative_weighting: 1 07/27/2021 09:21:49 - INFO - <<< num_thread_reader: 4

07/27/2021 09:21:49 - INFO - <<< rank: 0 07/27/2021 09:21:49 - INFO - <<< sampled_use_mil: False 07/27/2021 09:21:49 - INFO - <<< seed: 42 07/27/2021 09:21:49 - INFO - <<< sim_header: seqTransf 07/27/2021 09:21:49 - INFO - <<< slice_framepos: 2 7/27/2021 09:21:49 - INFO - <<< task_type: retrieval 07/27/2021 09:21:50 - INFO - <<< text_num_hidden_layers: 12 07/27/2021 09:21:50 - INFO - <<< train_csv: MSRVTT_train.9k.csv 07/27/2021 09:21:50 - INFO - <<< train_frame_order: 0 07/27/2021 09:21:50 - INFO - <<< use_mil: False 07/27/2021 09:21:50 - INFO - <<< val_csv: MSRVTT_JSFUSION_test.csv 07/27/2021 09:21:50 - INFO - <<< video_dim: 1024 07/27/2021 09:21:50 - INFO - <<< visual_num_hidden_layers: 12 07/27/2021 09:21:50 - INFO - <<< warmup_proportion: 0.1 07/27/2021 09:21:50 - INFO - <<< world_size: 2 07/27/2021 09:21:50 - INFO - device: cuda:0 n_gpu: 2 07/27/2021 09:21:51 - INFO - loading archive file clip_raw/modules/cross-base 07/27/2021 09:21:51 - INFO - Model config { "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 512, "initializer_range": 0.02, "intermediate_size": 2048, "max_position_embeddings": 77, "num_attention_heads": 8, "num_hidden_layers": 4, "type_vocab_size": 2, 07/27/2021 09:21:51 - INFO - cross-base 07/27/2021 09:21:51 - INFO - Model config { "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 512, "initializer_range": 0.02, "intermediate_size": 2048, "max_position_embeddings": 77, "num_attention_heads": 8, "num_hidden_layers": 4, "type_vocab_size": 2, "vocab_size": 512 } 07/27/2021 09:21:51 - INFO - Weight doesn't exsits. modules/cross-base/cross_pytorch_model.bin 07/27/2021 09:21:51 - WARNING - Stage-One:True, Stage-Two:False 07/27/2021 09:21:51 - WARNING - Test retrieval by loose type. 07/27/2021 09:21:51 - WARNING - embed_dim: 512 07/27/2021 09:21:51 - WARNING - image_resolution: 224 07/27/2021 09:21:51 - WARNING - vision_layers: 12 07/27/2021 09:21:51 - WARNING - vision_width: 768 07/27/2021 09:21:51 - WARNING - vision_patch_size: 32 07/27/2021 09:21:51 - WARNING - context_length: 77 07/27/2021 09:21:51 - WARNING - vocab_size: 49408 07/27/2021 09:21:51 - WARNING - transformer_width: 512 07/27/2021 09:21:51 - WARNING - transformer_heads: 8 07/27/2021 09:21:51 - WARNING - transformer_layers: 12 07/27/2021 09:21:51 - WARNING - linear_patch: 2d 07/27/2021 09:21:51 - WARNING - cut_top_layer: 0 07/27/2021 09:21:54 - WARNING - sim_header: seqTransf 07/27/2021 09:22:05 - INFO - -------------------- 07/27/2021 09:22:05 - INFO - Weights from pretrained model not used in CLIP4Clip: clip.input_resolution clip.context_length clip.vocab_size 07/27/2021 09:22:06 - INFO - Running test 07/27/2021 09:22:06 - INFO - Num examples = 1000 07/27/2021 09:22:06 - INFO - Batch size = 12 07/27/2021 09:22:06 - INFO - Num steps = 84 07/27/2021 09:22:06 - INFO - Running val 07/27/2021 09:22:06 - INFO - Num examples = 1000 07/27/2021 09:22:25 - INFO - Running training 07/27/2021 09:22:25 - INFO - Num examples = 180000 07/27/2021 09:22:25 - INFO - Batch size = 128 07/27/2021 09:22:25 - INFO - Num steps = 7030 07/27/2021 09:23:12 - INFO - Epoch: 1/5, Step: 20/1406, Lr: 0.000000003-0.000002845, Loss: 1.888217, Time/step: 2.334767 07/27/2021 09:23:47 - INFO - Epoch: 1/5, Step: 40/1406, Lr: 0.000000006-0.000005690, Loss: 1.840597, Time/step: 1.778786 07/27/2021 09:24:23 - INFO - Epoch: 1/5, Step: 60/1406, Lr: 0.000000009-0.000008535, Loss: 1.790529, Time/step: 1.789728 ······ 07/27/2021 10:05:17 - INFO - Epoch 1/5 Finished, Train Loss: 1.002715 07/27/2021 10:05:22 - INFO - Model saved to pytorch_model.bin.0 07/27/2021 10:07:57 - INFO - sim matrix size: 1000, 1000 07/27/2021 10:07:57 - INFO - Length-T: 1000, Length-V:1000 07/27/2021 10:07:57 - INFO - Text-to-Video: 07/27/2021 10:07:57 - INFO - >>> R@1: 41.7 - R@5: 69.9 - R@10: 80.4 - Median R: 2.0 - Mean R: 15.3 07/27/2021 10:07:57 - INFO - Video-to-Text: 07/27/2021 10:07:57 - INFO - >>> V2T$R@1: 41.4 - V2T$R@5: 68.5 - V2T$R@10: 79.8 - V2T$Median R: 2.0 - V2T$Mean R: 13.0 ·······

ArrowLuo commented 3 years ago

Hi @sqiangcao99, I do not find the essential difference. Maybe you can test on the same GPU number (not sure). Below is our log before the first epoch for your information (log format Video-to-Text: is added before releasing the code). If you have any new progress on this problem, welcome to share with me. Thanks.

......
2021-04-12 22:01:18,769:INFO:   <<< n_display: 50
......
2021-04-12 22:01:18,772:INFO:   <<< world_size: 4

......
2021-04-12 22:01:53,082:INFO: ***** Running test *****
2021-04-12 22:01:53,082:INFO:   Num examples = 1000
2021-04-12 22:01:53,082:INFO:   Batch size = 32
2021-04-12 22:01:53,082:INFO:   Num steps = 32
2021-04-12 22:02:08,860:INFO: ***** Running training *****
2021-04-12 22:02:08,861:INFO:   Num examples = 180000
2021-04-12 22:02:08,861:INFO:   Batch size = 128
2021-04-12 22:02:08,861:INFO:   Num steps = 7030
2021-04-12 22:05:06,570:INFO: Epoch: 1/5, Step: 50/1406, Lr: 0.000000007-0.000007112, Loss: 1.656809, Time/step: 3.554149
2021-04-12 22:07:54,349:INFO: Epoch: 1/5, Step: 100/1406, Lr: 0.000000014-0.000014225, Loss: 1.740155, Time/step: 3.355569
2021-04-12 22:10:39,252:INFO: Epoch: 1/5, Step: 150/1406, Lr: 0.000000021-0.000021337, Loss: 1.050747, Time/step: 3.298057
2021-04-12 22:13:23,374:INFO: Epoch: 1/5, Step: 200/1406, Lr: 0.000000028-0.000028450, Loss: 1.297645, Time/step: 3.282416
2021-04-12 22:16:01,482:INFO: Epoch: 1/5, Step: 250/1406, Lr: 0.000000036-0.000035562, Loss: 1.209267, Time/step: 3.162147
2021-04-12 22:18:38,089:INFO: Epoch: 1/5, Step: 300/1406, Lr: 0.000000043-0.000042674, Loss: 1.197618, Time/step: 3.132139
......
2021-04-12 23:12:04,353:INFO: Epoch: 1/5, Step: 1300/1406, Lr: 0.000000092-0.000091797, Loss: 0.667563, Time/step: 3.341208
2021-04-12 23:14:54,517:INFO: Epoch: 1/5, Step: 1350/1406, Lr: 0.000000091-0.000091174, Loss: 0.668952, Time/step: 3.403264
2021-04-12 23:17:39,297:INFO: Epoch: 1/5, Step: 1400/1406, Lr: 0.000000091-0.000090530, Loss: 0.393518, Time/step: 3.295593
2021-04-12 23:18:00,145:INFO: Epoch 1/5 Finished, Train Loss: 1.004058
2021-04-12 23:20:18,768:INFO: sim matrix size: 1000, 1000
2021-04-12 23:20:18,870:INFO:    Length-T: 1000, Length-V:1000
2021-04-12 23:20:18,870:INFO: Text-to-Video:
2021-04-12 23:20:18,871:INFO:   >>>  R@1: 42.3 - R@5: 70.5 - R@10: 79.8 - Median R: 2.0 - Mean R: 16.2
2021-04-12 23:20:18,871:INFO:   >>>  V2T$R@1: 42.3 - V2T$R@5: 70.1 - V2T$R@10: 80.1 - V2T$Median R: 2.0 - V2T$Mean R: 12.7
sqiangcao99 commented 3 years ago

Thank you for your continued attention to this issue. I have tried with 4 GPU. The result is still not the same. Is it because of the CUDA version or the Datasets?

Driver Version: 450.51.06 CUDA Version: 11.0

···
07/28/2021 16:15:25 - INFO -     <<< rank: 0
07/28/2021 16:15:25 - INFO -     <<< sampled_use_mil: False
07/28/2021 16:15:25 - INFO -     <<< seed: 42
07/28/2021 16:15:25 - INFO -     <<< sim_header: seqTransf
07/28/2021 16:15:25 - INFO -     <<< slice_framepos: 2
07/28/2021 16:15:25 - INFO -     <<< task_type: retrieval
07/28/2021 16:15:25 - INFO -     <<< text_num_hidden_layers: 12
07/28/2021 16:15:25 - INFO -     <<< train_csv: csv/msrvtt_data/MSRVTT_train.9k.csv
07/28/2021 16:15:25 - INFO -     <<< train_frame_order: 0
07/28/2021 16:15:25 - INFO -     <<< use_mil: False
07/28/2021 16:15:25 - INFO -     <<< val_csv: csv/msrvtt_data/MSRVTT_JSFUSION_test.csv
07/28/2021 16:15:25 - INFO -     <<< video_dim: 1024
07/28/2021 16:15:25 - INFO -     <<< visual_num_hidden_layers: 12
07/28/2021 16:15:25 - INFO -     <<< warmup_proportion: 0.1
07/28/2021 16:15:25 - INFO -     <<< world_size: 4
···
07/28/2021 16:16:03 - INFO -     Num steps = 7030
07/28/2021 16:17:17 - INFO -   Epoch: 1/5, Step: 50/1406, Lr: 0.000000007-0.000007112, Loss: 1.702823, Time/step: 1.466081
07/28/2021 16:18:25 - INFO -   Epoch: 1/5, Step: 100/1406, Lr: 0.000000014-0.000014225, Loss: 1.731421, Time/step: 1.370271
07/28/2021 16:19:33 - INFO -   Epoch: 1/5, Step: 150/1406, Lr: 0.000000021-0.000021337, Loss: 1.066895, Time/step: 1.357439
07/28/2021 16:20:42 - INFO -   Epoch: 1/5, Step: 200/1406, Lr: 0.000000028-0.000028450, Loss: 1.292294, Time/step: 1.369586
07/28/2021 16:21:50 - INFO -   Epoch: 1/5, Step: 250/1406, Lr: 0.000000036-0.000035562, Loss: 1.193302, Time/step: 1.368164
····
07/28/2021 16:39:22 - INFO -   Epoch: 1/5, Step: 950/1406, Lr: 0.000000096-0.000095561, Loss: 0.918803, Time/step: 1.856508
07/28/2021 16:40:57 - INFO -   Epoch: 1/5, Step: 1000/1406, Lr: 0.000000095-0.000095090, Loss: 1.007762, Time/step: 1.902733
07/28/2021 16:42:31 - INFO -   Epoch: 1/5, Step: 1050/1406, Lr: 0.000000095-0.000094596, Loss: 0.818108, Time/step: 1.882588
07/28/2021 16:44:06 - INFO -   Epoch: 1/5, Step: 1100/1406, Lr: 0.000000094-0.000094080, Loss: 0.636207, Time/step: 1.898808
07/28/2021 16:45:39 - INFO -   Epoch: 1/5, Step: 1150/1406, Lr: 0.000000094-0.000093541, Loss: 0.688115, Time/step: 1.855371
07/28/2021 16:47:12 - INFO -   Epoch: 1/5, Step: 1200/1406, Lr: 0.000000093-0.000092981, Loss: 0.807981, Time/step: 1.857128
07/28/2021 16:48:36 - INFO -   Epoch: 1/5, Step: 1250/1406, Lr: 0.000000092-0.000092400, Loss: 0.832460, Time/step: 1.679541
07/28/2021 16:49:44 - INFO -   Epoch: 1/5, Step: 1300/1406, Lr: 0.000000092-0.000091797, Loss: 0.682100, Time/step: 1.369774
07/28/2021 16:50:53 - INFO -   Epoch: 1/5, Step: 1350/1406, Lr: 0.000000091-0.000091174, Loss: 0.665236, Time/step: 1.374193
07/28/2021 16:52:01 - INFO -   Epoch: 1/5, Step: 1400/1406, Lr: 0.000000091-0.000090530, Loss: 0.414655, Time/step: 1.362176
07/28/2021 16:52:09 - INFO -   Epoch 1/5 Finished, Train Loss: 1.002616
···
07/28/2021 16:54:20 - INFO -   sim matrix size: 1000, 1000
07/28/2021 16:54:20 - INFO -     Length-T: 1000, Length-V:1000
07/28/2021 16:54:20 - INFO -   Text-to-Video:
07/28/2021 16:54:20 - INFO -    >>>  R@1: 41.1 - R@5: 69.8 - R@10: 80.3 - Median R: 2.0 - Mean R: 15.3
07/28/2021 16:54:20 - INFO -   Video-to-Text:
07/28/2021 16:54:20 - INFO -    >>>  V2T$R@1: 41.7 - V2T$R@5: 68.5 - V2T$R@10: 80.2 - V2T$Median R: 2.0 - V2T$Mean R: 13.1
sqiangcao99 commented 3 years ago

By the way, I have tried to speed up the training process by saving the values returned by the dataset class first.

         video, video_mask = self._get_rawvideo(choice_video_ids)
         # the code for saving the files
         if video_id not in self.saved_video:
             self.saved_video.append(video_id)
             video_info = {}
             video_info['video'] = video
             video_info['video_mask'] = video_mask
             save_path = os.path.join(self.save_path,video_id)
             np.save(save_path, video_info)
ArrowLuo commented 3 years ago

Hi @sqiangcao99, we have almost the same CUDA driver, Driver Version: 450.80.02 CUDA Version: 11.0.

The datasets are the same, too. Curiously, I think we have the same settings now but the performance has a little gap. Your Train Loss is lower than ours, too. How about your other sim_header? Are they acceptable?

You can also use LMDB to speed up, too. It is memory-friendly. Also thanks for your suggestion.

sqiangcao99 commented 3 years ago

I tried the meanP. When I set the epoch num is 5, the results are also worse. But when I set epoch num to 3, the results get better.

When epoch num is 3:
 2021-06-13 01:58:59,920:INFO: Text-to-Video:
 2021-06-13 01:58:59,921:INFO:   >>>  R@1: 43.0 - R@5: 70.3 - R@10: 80.4 - Median R: 2.0 - Mean R: 15.8
 2021-06-13 01:58:59,921:INFO: Video-to-Text:
2021-06-13 01:58:59,921:INFO:   >>>  V2T$R@1: 42.6 - V2T$R@5: 70.8 - V2T$R@10: 81.4 - V2T$Median R: 2.0 - V2T$Mean R: 11.9

When epoch num is 5:
2021-06-04 22:36:14,397:INFO:   >>>  R@1: 42.2 - R@5: 71.2 - R@10: 80.6 - Median R: 2.0 - Mean R: 15.8
2021-06-04 22:36:14,397:INFO: Video-to-Text:
2021-06-04 22:36:14,398:INFO:   >>>  V2T$R@1: 41.8 - V2T$R@5: 70.6 - V2T$R@10: 81.1 - V2T$Median R: 2.0 - V2T$Mean R: 11.8
2021-06-04 22:36:14,399:INFO: The best model is: None, the R1 is: 42.2000
ArrowLuo commented 3 years ago

Oh, it is not totally the same as ours. I do not know whether the gap is normal for this reproduction now. It is strange if you did not change any code on ours, and I have no more idea about this problem now.

If you want to compare your results with ours in your research, an idea I think is that you can report your implementation because they are got in the same environment and dataset. Thanks for your sharing and discussion.

sqiangcao99 commented 3 years ago

Thank you so much for helping me. I have learned a lot.

starmemda commented 3 years ago

Thank you so much for helping me. I have learned a lot.

It's strange that I can't reproduce the result, too. Maybe we can get a connection and discuss that where is the problem. My QQ number is 1471659527.