ArrowLuo / CLIP4Clip

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"
https://arxiv.org/abs/2104.08860
MIT License
879 stars 123 forks source link

The reproduction results of `meanP` and `seqTransf` on `DiDeMo` dataset are much worse than those in the paper #62

Closed HanielF closed 2 years ago

HanielF commented 2 years ago

I trained clip4clip on didemo dataset, and the R@1 of text-to-video is much worse than that shown in paper.

The metric reported in paper is 43.4 on DiDeMo when similarity calculator is meanP and is 42.8 when the head is seqTransf.

But according to my reproduction result, based on meanP, max t2v R@1 is only up to 40.5, and it only up to 40.2 based on seqTransf.

All settings remain the same as in the paper.

image
HanielF commented 2 years ago

The training log of meanP is as follows:

2022-02-25 16:59:39,551:INFO: Effective parameters:
2022-02-25 16:59:39,551:INFO:   <<< batch_size: 128
2022-02-25 16:59:39,551:INFO: device: cuda:2 n_gpu: 8
2022-02-25 16:59:39,551:INFO: device: cuda:4 n_gpu: 8
2022-02-25 16:59:39,551:INFO: device: cuda:1 n_gpu: 8
2022-02-25 16:59:39,551:INFO:   <<< batch_size_val: 16
2022-02-25 16:59:39,551:INFO:   <<< cache_dir: 
2022-02-25 16:59:39,551:INFO:   <<< coef_lr: 0.001
2022-02-25 16:59:39,552:INFO:   <<< cross_model: cross-base
2022-02-25 16:59:39,552:INFO:   <<< cross_num_hidden_layers: 4
2022-02-25 16:59:39,552:INFO:   <<< data_path: /root/wanghaoran09/xudi03/CLIP4Clip/data/DiDeMo
2022-02-25 16:59:39,552:INFO:   <<< datatype: didemo
2022-02-25 16:59:39,552:INFO:   <<< do_eval: False
2022-02-25 16:59:39,552:INFO:   <<< do_lower_case: False
2022-02-25 16:59:39,552:INFO:   <<< do_pretrain: False
2022-02-25 16:59:39,552:INFO: device: cuda:3 n_gpu: 8
2022-02-25 16:59:39,552:INFO:   <<< do_train: True
2022-02-25 16:59:39,552:INFO:   <<< epochs: 10
2022-02-25 16:59:39,552:INFO:   <<< eval_frame_order: 0
2022-02-25 16:59:39,552:INFO:   <<< expand_msrvtt_sentences: False
2022-02-25 16:59:39,552:INFO: device: cuda:6 n_gpu: 8
2022-02-25 16:59:39,552:INFO:   <<< feature_framerate: 1
2022-02-25 16:59:39,552:INFO:   <<< features_path: /root/wanghaoran09/xudi03/CLIP4Clip/data/DiDeMo/DiDeMo_Compress
2022-02-25 16:59:39,552:INFO:   <<< fp16: False
2022-02-25 16:59:39,552:INFO:   <<< fp16_opt_level: O1
2022-02-25 16:59:39,552:INFO:   <<< freeze_layer_num: 0
2022-02-25 16:59:39,552:INFO:   <<< gradient_accumulation_steps: 1
2022-02-25 16:59:39,552:INFO:   <<< hard_negative_rate: 0.5
2022-02-25 16:59:39,552:INFO: device: cuda:7 n_gpu: 8
2022-02-25 16:59:39,552:INFO:   <<< init_model: None
2022-02-25 16:59:39,552:INFO:   <<< linear_patch: 2d
2022-02-25 16:59:39,552:INFO:   <<< local_rank: 0
2022-02-25 16:59:39,552:INFO:   <<< loose_type: True
2022-02-25 16:59:39,553:INFO:   <<< lr: 0.0001
2022-02-25 16:59:39,553:INFO:   <<< lr_decay: 0.9
2022-02-25 16:59:39,553:INFO:   <<< margin: 0.1
2022-02-25 16:59:39,553:INFO:   <<< max_frames: 64
2022-02-25 16:59:39,553:INFO:   <<< max_words: 64
2022-02-25 16:59:39,553:INFO:   <<< n_display: 10
2022-02-25 16:59:39,553:INFO:   <<< n_gpu: 1
2022-02-25 16:59:39,553:INFO:   <<< n_pair: 1
2022-02-25 16:59:39,553:INFO:   <<< negative_weighting: 1
2022-02-25 16:59:39,553:INFO:   <<< num_thread_reader: 2
2022-02-25 16:59:39,553:INFO:   <<< output_dir: ckpts/ckpt_didemo_retrieval_looseType_DiDeMo_meanP
2022-02-25 16:59:39,553:INFO:   <<< pretrained_clip_name: ViT-B/32
2022-02-25 16:59:39,553:INFO:   <<< rank: 0
2022-02-25 16:59:39,553:INFO:   <<< resume_model: None
2022-02-25 16:59:39,553:INFO:   <<< sampled_use_mil: False
2022-02-25 16:59:39,553:INFO:   <<< seed: 42
2022-02-25 16:59:39,553:INFO:   <<< sim_header: meanP
2022-02-25 16:59:39,553:INFO:   <<< slice_framepos: 2
2022-02-25 16:59:39,553:INFO:   <<< task_type: retrieval
2022-02-25 16:59:39,553:INFO:   <<< text_num_hidden_layers: 12
2022-02-25 16:59:39,553:INFO:   <<< train_csv: data/.train.csv
2022-02-25 16:59:39,553:INFO:   <<< train_frame_order: 0
2022-02-25 16:59:39,553:INFO:   <<< use_mil: False
2022-02-25 16:59:39,553:INFO:   <<< val_csv: data/.val.csv
2022-02-25 16:59:39,554:INFO:   <<< video_dim: 1024
2022-02-25 16:59:39,554:INFO:   <<< visual_num_hidden_layers: 12
2022-02-25 16:59:39,554:INFO:   <<< warmup_proportion: 0.1
2022-02-25 16:59:39,554:INFO:   <<< world_size: 8
2022-02-25 16:59:39,554:INFO: device: cuda:0 n_gpu: 8
2022-02-25 16:59:40,579:INFO: loading archive file /root/wanghaoran09/xudi03/CLIP4Clip-master/modules/cross-base
2022-02-25 16:59:40,579:INFO: Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 512,
  "initializer_range": 0.02,
  "intermediate_size": 2048,
  "max_position_embeddings": 128,
  "num_attention_heads": 8,
  "num_hidden_layers": 4,
  "type_vocab_size": 2,
  "vocab_size": 512
}

2022-02-25 16:59:40,579:INFO: Weight doesn't exsits. /root/wanghaoran09/xudi03/CLIP4Clip-master/modules/cross-base/cross_pytorch_model.bin
2022-02-25 16:59:40,579:WARNING: Stage-One:True, Stage-Two:False
2022-02-25 16:59:40,579:WARNING: Test retrieval by loose type.
2022-02-25 16:59:40,580:WARNING:         embed_dim: 512
2022-02-25 16:59:40,580:WARNING:         image_resolution: 224
2022-02-25 16:59:40,580:WARNING:         vision_layers: 12
2022-02-25 16:59:40,580:WARNING:         vision_width: 768
2022-02-25 16:59:40,580:WARNING:         vision_patch_size: 32
2022-02-25 16:59:40,580:WARNING:         context_length: 77
2022-02-25 16:59:40,580:WARNING:         vocab_size: 49408
2022-02-25 16:59:40,580:WARNING:         transformer_width: 512
2022-02-25 16:59:40,580:WARNING:         transformer_heads: 8
2022-02-25 16:59:40,580:WARNING:         transformer_layers: 12
2022-02-25 16:59:40,580:WARNING:                 linear_patch: 2d
2022-02-25 16:59:40,580:WARNING:         cut_top_layer: 0
2022-02-25 16:59:42,763:WARNING:         sim_header: meanP
2022-02-25 16:59:52,833:INFO: --------------------
2022-02-25 16:59:52,833:INFO: Weights from pretrained model not used in CLIP4Clip: 
   clip.input_resolution
   clip.context_length
   clip.vocab_size
........
2022-02-25 18:10:42,342:INFO: Epoch 4/10 Finished, Train Loss: 0.245592
2022-02-25 18:10:44,235:INFO: Model saved to ckpts/ckpt_didemo_retrieval_looseType_DiDeMo_meanP/pytorch_model.bin.3
2022-02-25 18:10:44,235:INFO: Optimizer saved to ckpts/ckpt_didemo_retrieval_looseType_DiDeMo_meanP/pytorch_opt.bin.3
2022-02-25 18:16:58,927:INFO: sim matrix size: 1003, 1003
2022-02-25 18:16:59,051:INFO:    Length-T: 1003, Length-V:1003
2022-02-25 18:16:59,051:INFO: Text-to-Video:
2022-02-25 18:16:59,051:INFO:   >>>  R@1: 40.5 - R@5: 68.3 - R@10: 77.3 - Median R: 2.0 - Mean R: 18.7
2022-02-25 18:16:59,051:INFO: Video-to-Text:
2022-02-25 18:16:59,052:INFO:   >>>  V2T$R@1: 40.4 - V2T$R@5: 67.6 - V2T$R@10: 77.9 - V2T$Median R: 2.0 - V2T$Mean R: 12.4
..........
ArrowLuo commented 2 years ago

Hi @HanielF, I am not sure what is wrong with your running from the above log. Does your epoch with 5 obtain worse results? And can you print the number of training samples and test samples?

HanielF commented 2 years ago

@ArrowLuo Thanks for your reply. I upload the full log below, nothing changed except directories that named with my name are masked. There are 14 videos are damaged and cannot be read correctly, so i removed them from the DiDeMo dateset.

log_meanP.txt log_seqT.txt

ArrowLuo commented 2 years ago

Hi @HanielF, can you run your command with 5 epochs again?

HanielF commented 2 years ago

If we only focus on the first five epochs, will the result be different? I will run it again, but I don’t think the epoch num is the min factor leading to poor results.

HanielF commented 2 years ago

I notice that the warmup step num is determined by warmup_proportion and num_train_optimization_steps, so large epoch num will increace the warmup step. I will set the epochs to 5 and train it again. Thanks for your suggestion!

HanielF commented 2 years ago

After change the epoch num from 10 to 5, max t2v R@1 is R@1: 41.6 under the configuration of seqTransf. The metrics are better than before. Thanks for your help! @ArrowLuo

jianghaojun commented 2 years ago

After change the epoch num from 10 to 5, max t2v R@1 is R@1: 41.6 under the configuration of seqTransf. The metrics are better than before. Thanks for your help! @ArrowLuo

The results is still poor than the R@1: 42.8(seqTransf reported in paper - Table 5). @HanielF Do you know the reason for this gap?

lucas0214 commented 5 months ago

meanP的训练日志如下:

2022-02-25 16:59:39,551:INFO: Effective parameters:
2022-02-25 16:59:39,551:INFO:   <<< batch_size: 128
2022-02-25 16:59:39,551:INFO: device: cuda:2 n_gpu: 8
2022-02-25 16:59:39,551:INFO: device: cuda:4 n_gpu: 8
2022-02-25 16:59:39,551:INFO: device: cuda:1 n_gpu: 8
2022-02-25 16:59:39,551:INFO:   <<< batch_size_val: 16
2022-02-25 16:59:39,551:INFO:   <<< cache_dir: 
2022-02-25 16:59:39,551:INFO:   <<< coef_lr: 0.001
2022-02-25 16:59:39,552:INFO:   <<< cross_model: cross-base
2022-02-25 16:59:39,552:INFO:   <<< cross_num_hidden_layers: 4
2022-02-25 16:59:39,552:INFO:   <<< data_path: /root/wanghaoran09/xudi03/CLIP4Clip/data/DiDeMo
2022-02-25 16:59:39,552:INFO:   <<< datatype: didemo
2022-02-25 16:59:39,552:INFO:   <<< do_eval: False
2022-02-25 16:59:39,552:INFO:   <<< do_lower_case: False
2022-02-25 16:59:39,552:INFO:   <<< do_pretrain: False
2022-02-25 16:59:39,552:INFO: device: cuda:3 n_gpu: 8
2022-02-25 16:59:39,552:INFO:   <<< do_train: True
2022-02-25 16:59:39,552:INFO:   <<< epochs: 10
2022-02-25 16:59:39,552:INFO:   <<< eval_frame_order: 0
2022-02-25 16:59:39,552:INFO:   <<< expand_msrvtt_sentences: False
2022-02-25 16:59:39,552:INFO: device: cuda:6 n_gpu: 8
2022-02-25 16:59:39,552:INFO:   <<< feature_framerate: 1
2022-02-25 16:59:39,552:INFO:   <<< features_path: /root/wanghaoran09/xudi03/CLIP4Clip/data/DiDeMo/DiDeMo_Compress
2022-02-25 16:59:39,552:INFO:   <<< fp16: False
2022-02-25 16:59:39,552:INFO:   <<< fp16_opt_level: O1
2022-02-25 16:59:39,552:INFO:   <<< freeze_layer_num: 0
2022-02-25 16:59:39,552:INFO:   <<< gradient_accumulation_steps: 1
2022-02-25 16:59:39,552:INFO:   <<< hard_negative_rate: 0.5
2022-02-25 16:59:39,552:INFO: device: cuda:7 n_gpu: 8
2022-02-25 16:59:39,552:INFO:   <<< init_model: None
2022-02-25 16:59:39,552:INFO:   <<< linear_patch: 2d
2022-02-25 16:59:39,552:INFO:   <<< local_rank: 0
2022-02-25 16:59:39,552:INFO:   <<< loose_type: True
2022-02-25 16:59:39,553:INFO:   <<< lr: 0.0001
2022-02-25 16:59:39,553:INFO:   <<< lr_decay: 0.9
2022-02-25 16:59:39,553:INFO:   <<< margin: 0.1
2022-02-25 16:59:39,553:INFO:   <<< max_frames: 64
2022-02-25 16:59:39,553:INFO:   <<< max_words: 64
2022-02-25 16:59:39,553:INFO:   <<< n_display: 10
2022-02-25 16:59:39,553:INFO:   <<< n_gpu: 1
2022-02-25 16:59:39,553:INFO:   <<< n_pair: 1
2022-02-25 16:59:39,553:INFO:   <<< negative_weighting: 1
2022-02-25 16:59:39,553:INFO:   <<< num_thread_reader: 2
2022-02-25 16:59:39,553:INFO:   <<< output_dir: ckpts/ckpt_didemo_retrieval_looseType_DiDeMo_meanP
2022-02-25 16:59:39,553:INFO:   <<< pretrained_clip_name: ViT-B/32
2022-02-25 16:59:39,553:INFO:   <<< rank: 0
2022-02-25 16:59:39,553:INFO:   <<< resume_model: None
2022-02-25 16:59:39,553:INFO:   <<< sampled_use_mil: False
2022-02-25 16:59:39,553:INFO:   <<< seed: 42
2022-02-25 16:59:39,553:INFO:   <<< sim_header: meanP
2022-02-25 16:59:39,553:INFO:   <<< slice_framepos: 2
2022-02-25 16:59:39,553:INFO:   <<< task_type: retrieval
2022-02-25 16:59:39,553:INFO:   <<< text_num_hidden_layers: 12
2022-02-25 16:59:39,553:INFO:   <<< train_csv: data/.train.csv
2022-02-25 16:59:39,553:INFO:   <<< train_frame_order: 0
2022-02-25 16:59:39,553:INFO:   <<< use_mil: False
2022-02-25 16:59:39,553:INFO:   <<< val_csv: data/.val.csv
2022-02-25 16:59:39,554:INFO:   <<< video_dim: 1024
2022-02-25 16:59:39,554:INFO:   <<< visual_num_hidden_layers: 12
2022-02-25 16:59:39,554:INFO:   <<< warmup_proportion: 0.1
2022-02-25 16:59:39,554:INFO:   <<< world_size: 8
2022-02-25 16:59:39,554:INFO: device: cuda:0 n_gpu: 8
2022-02-25 16:59:40,579:INFO: loading archive file /root/wanghaoran09/xudi03/CLIP4Clip-master/modules/cross-base
2022-02-25 16:59:40,579:INFO: Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 512,
  "initializer_range": 0.02,
  "intermediate_size": 2048,
  "max_position_embeddings": 128,
  "num_attention_heads": 8,
  "num_hidden_layers": 4,
  "type_vocab_size": 2,
  "vocab_size": 512
}

2022-02-25 16:59:40,579:INFO: Weight doesn't exsits. /root/wanghaoran09/xudi03/CLIP4Clip-master/modules/cross-base/cross_pytorch_model.bin
2022-02-25 16:59:40,579:WARNING: Stage-One:True, Stage-Two:False
2022-02-25 16:59:40,579:WARNING: Test retrieval by loose type.
2022-02-25 16:59:40,580:WARNING:         embed_dim: 512
2022-02-25 16:59:40,580:WARNING:         image_resolution: 224
2022-02-25 16:59:40,580:WARNING:         vision_layers: 12
2022-02-25 16:59:40,580:WARNING:         vision_width: 768
2022-02-25 16:59:40,580:WARNING:         vision_patch_size: 32
2022-02-25 16:59:40,580:WARNING:         context_length: 77
2022-02-25 16:59:40,580:WARNING:         vocab_size: 49408
2022-02-25 16:59:40,580:WARNING:         transformer_width: 512
2022-02-25 16:59:40,580:WARNING:         transformer_heads: 8
2022-02-25 16:59:40,580:WARNING:         transformer_layers: 12
2022-02-25 16:59:40,580:WARNING:                 linear_patch: 2d
2022-02-25 16:59:40,580:WARNING:         cut_top_layer: 0
2022-02-25 16:59:42,763:WARNING:         sim_header: meanP
2022-02-25 16:59:52,833:INFO: --------------------
2022-02-25 16:59:52,833:INFO: Weights from pretrained model not used in CLIP4Clip: 
   clip.input_resolution
   clip.context_length
   clip.vocab_size
........
2022-02-25 18:10:42,342:INFO: Epoch 4/10 Finished, Train Loss: 0.245592
2022-02-25 18:10:44,235:INFO: Model saved to ckpts/ckpt_didemo_retrieval_looseType_DiDeMo_meanP/pytorch_model.bin.3
2022-02-25 18:10:44,235:INFO: Optimizer saved to ckpts/ckpt_didemo_retrieval_looseType_DiDeMo_meanP/pytorch_opt.bin.3
2022-02-25 18:16:58,927:INFO: sim matrix size: 1003, 1003
2022-02-25 18:16:59,051:INFO:    Length-T: 1003, Length-V:1003
2022-02-25 18:16:59,051:INFO: Text-to-Video:
2022-02-25 18:16:59,051:INFO:   >>>  R@1: 40.5 - R@5: 68.3 - R@10: 77.3 - Median R: 2.0 - Mean R: 18.7
2022-02-25 18:16:59,051:INFO: Video-to-Text:
2022-02-25 18:16:59,052:INFO:   >>>  V2T$R@1: 40.4 - V2T$R@5: 67.6 - V2T$R@10: 77.9 - V2T$Median R: 2.0 - V2T$Mean R: 12.4
..........

Hello, excuse me. I read your paper: Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval, However, when I reproduced the code didemo, I found that I could not find train_data.json and other json files in the data set. I obtained all the data from clip4clip. """ video_json_path_dict = {} video_json_path_dict["train"] = os.path.join(self.data_path, "train_data_mp4.json") video_json_path_dict["val"] = os.path.join(self.data_path, "test_data_mp4.json") video_json_path_dict["test"] = os.path.join(self.data_path, "test_data_mp4.json")