Closed HanielF closed 2 years ago
The training log of meanP is as follows:
2022-02-25 16:59:39,551:INFO: Effective parameters:
2022-02-25 16:59:39,551:INFO: <<< batch_size: 128
2022-02-25 16:59:39,551:INFO: device: cuda:2 n_gpu: 8
2022-02-25 16:59:39,551:INFO: device: cuda:4 n_gpu: 8
2022-02-25 16:59:39,551:INFO: device: cuda:1 n_gpu: 8
2022-02-25 16:59:39,551:INFO: <<< batch_size_val: 16
2022-02-25 16:59:39,551:INFO: <<< cache_dir:
2022-02-25 16:59:39,551:INFO: <<< coef_lr: 0.001
2022-02-25 16:59:39,552:INFO: <<< cross_model: cross-base
2022-02-25 16:59:39,552:INFO: <<< cross_num_hidden_layers: 4
2022-02-25 16:59:39,552:INFO: <<< data_path: /root/wanghaoran09/xudi03/CLIP4Clip/data/DiDeMo
2022-02-25 16:59:39,552:INFO: <<< datatype: didemo
2022-02-25 16:59:39,552:INFO: <<< do_eval: False
2022-02-25 16:59:39,552:INFO: <<< do_lower_case: False
2022-02-25 16:59:39,552:INFO: <<< do_pretrain: False
2022-02-25 16:59:39,552:INFO: device: cuda:3 n_gpu: 8
2022-02-25 16:59:39,552:INFO: <<< do_train: True
2022-02-25 16:59:39,552:INFO: <<< epochs: 10
2022-02-25 16:59:39,552:INFO: <<< eval_frame_order: 0
2022-02-25 16:59:39,552:INFO: <<< expand_msrvtt_sentences: False
2022-02-25 16:59:39,552:INFO: device: cuda:6 n_gpu: 8
2022-02-25 16:59:39,552:INFO: <<< feature_framerate: 1
2022-02-25 16:59:39,552:INFO: <<< features_path: /root/wanghaoran09/xudi03/CLIP4Clip/data/DiDeMo/DiDeMo_Compress
2022-02-25 16:59:39,552:INFO: <<< fp16: False
2022-02-25 16:59:39,552:INFO: <<< fp16_opt_level: O1
2022-02-25 16:59:39,552:INFO: <<< freeze_layer_num: 0
2022-02-25 16:59:39,552:INFO: <<< gradient_accumulation_steps: 1
2022-02-25 16:59:39,552:INFO: <<< hard_negative_rate: 0.5
2022-02-25 16:59:39,552:INFO: device: cuda:7 n_gpu: 8
2022-02-25 16:59:39,552:INFO: <<< init_model: None
2022-02-25 16:59:39,552:INFO: <<< linear_patch: 2d
2022-02-25 16:59:39,552:INFO: <<< local_rank: 0
2022-02-25 16:59:39,552:INFO: <<< loose_type: True
2022-02-25 16:59:39,553:INFO: <<< lr: 0.0001
2022-02-25 16:59:39,553:INFO: <<< lr_decay: 0.9
2022-02-25 16:59:39,553:INFO: <<< margin: 0.1
2022-02-25 16:59:39,553:INFO: <<< max_frames: 64
2022-02-25 16:59:39,553:INFO: <<< max_words: 64
2022-02-25 16:59:39,553:INFO: <<< n_display: 10
2022-02-25 16:59:39,553:INFO: <<< n_gpu: 1
2022-02-25 16:59:39,553:INFO: <<< n_pair: 1
2022-02-25 16:59:39,553:INFO: <<< negative_weighting: 1
2022-02-25 16:59:39,553:INFO: <<< num_thread_reader: 2
2022-02-25 16:59:39,553:INFO: <<< output_dir: ckpts/ckpt_didemo_retrieval_looseType_DiDeMo_meanP
2022-02-25 16:59:39,553:INFO: <<< pretrained_clip_name: ViT-B/32
2022-02-25 16:59:39,553:INFO: <<< rank: 0
2022-02-25 16:59:39,553:INFO: <<< resume_model: None
2022-02-25 16:59:39,553:INFO: <<< sampled_use_mil: False
2022-02-25 16:59:39,553:INFO: <<< seed: 42
2022-02-25 16:59:39,553:INFO: <<< sim_header: meanP
2022-02-25 16:59:39,553:INFO: <<< slice_framepos: 2
2022-02-25 16:59:39,553:INFO: <<< task_type: retrieval
2022-02-25 16:59:39,553:INFO: <<< text_num_hidden_layers: 12
2022-02-25 16:59:39,553:INFO: <<< train_csv: data/.train.csv
2022-02-25 16:59:39,553:INFO: <<< train_frame_order: 0
2022-02-25 16:59:39,553:INFO: <<< use_mil: False
2022-02-25 16:59:39,553:INFO: <<< val_csv: data/.val.csv
2022-02-25 16:59:39,554:INFO: <<< video_dim: 1024
2022-02-25 16:59:39,554:INFO: <<< visual_num_hidden_layers: 12
2022-02-25 16:59:39,554:INFO: <<< warmup_proportion: 0.1
2022-02-25 16:59:39,554:INFO: <<< world_size: 8
2022-02-25 16:59:39,554:INFO: device: cuda:0 n_gpu: 8
2022-02-25 16:59:40,579:INFO: loading archive file /root/wanghaoran09/xudi03/CLIP4Clip-master/modules/cross-base
2022-02-25 16:59:40,579:INFO: Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 512,
"initializer_range": 0.02,
"intermediate_size": 2048,
"max_position_embeddings": 128,
"num_attention_heads": 8,
"num_hidden_layers": 4,
"type_vocab_size": 2,
"vocab_size": 512
}
2022-02-25 16:59:40,579:INFO: Weight doesn't exsits. /root/wanghaoran09/xudi03/CLIP4Clip-master/modules/cross-base/cross_pytorch_model.bin
2022-02-25 16:59:40,579:WARNING: Stage-One:True, Stage-Two:False
2022-02-25 16:59:40,579:WARNING: Test retrieval by loose type.
2022-02-25 16:59:40,580:WARNING: embed_dim: 512
2022-02-25 16:59:40,580:WARNING: image_resolution: 224
2022-02-25 16:59:40,580:WARNING: vision_layers: 12
2022-02-25 16:59:40,580:WARNING: vision_width: 768
2022-02-25 16:59:40,580:WARNING: vision_patch_size: 32
2022-02-25 16:59:40,580:WARNING: context_length: 77
2022-02-25 16:59:40,580:WARNING: vocab_size: 49408
2022-02-25 16:59:40,580:WARNING: transformer_width: 512
2022-02-25 16:59:40,580:WARNING: transformer_heads: 8
2022-02-25 16:59:40,580:WARNING: transformer_layers: 12
2022-02-25 16:59:40,580:WARNING: linear_patch: 2d
2022-02-25 16:59:40,580:WARNING: cut_top_layer: 0
2022-02-25 16:59:42,763:WARNING: sim_header: meanP
2022-02-25 16:59:52,833:INFO: --------------------
2022-02-25 16:59:52,833:INFO: Weights from pretrained model not used in CLIP4Clip:
clip.input_resolution
clip.context_length
clip.vocab_size
........
2022-02-25 18:10:42,342:INFO: Epoch 4/10 Finished, Train Loss: 0.245592
2022-02-25 18:10:44,235:INFO: Model saved to ckpts/ckpt_didemo_retrieval_looseType_DiDeMo_meanP/pytorch_model.bin.3
2022-02-25 18:10:44,235:INFO: Optimizer saved to ckpts/ckpt_didemo_retrieval_looseType_DiDeMo_meanP/pytorch_opt.bin.3
2022-02-25 18:16:58,927:INFO: sim matrix size: 1003, 1003
2022-02-25 18:16:59,051:INFO: Length-T: 1003, Length-V:1003
2022-02-25 18:16:59,051:INFO: Text-to-Video:
2022-02-25 18:16:59,051:INFO: >>> R@1: 40.5 - R@5: 68.3 - R@10: 77.3 - Median R: 2.0 - Mean R: 18.7
2022-02-25 18:16:59,051:INFO: Video-to-Text:
2022-02-25 18:16:59,052:INFO: >>> V2T$R@1: 40.4 - V2T$R@5: 67.6 - V2T$R@10: 77.9 - V2T$Median R: 2.0 - V2T$Mean R: 12.4
..........
Hi @HanielF, I am not sure what is wrong with your running from the above log. Does your epoch with 5 obtain worse results? And can you print the number of training samples and test samples?
@ArrowLuo Thanks for your reply. I upload the full log below, nothing changed except directories that named with my name are masked. There are 14 videos are damaged and cannot be read correctly, so i removed them from the DiDeMo dateset.
Hi @HanielF, can you run your command with 5 epochs again?
If we only focus on the first five epochs, will the result be different? I will run it again, but I don’t think the epoch num is the min factor leading to poor results.
I notice that the warmup step num is determined by warmup_proportion and num_train_optimization_steps, so large epoch num will increace the warmup step. I will set the epochs
to 5 and train it again. Thanks for your suggestion!
After change the epoch num from 10 to 5, max t2v R@1 is R@1: 41.6
under the configuration of seqTransf
. The metrics are better than before.
Thanks for your help! @ArrowLuo
After change the epoch num from 10 to 5, max t2v R@1 is
R@1: 41.6
under the configuration ofseqTransf
. The metrics are better than before. Thanks for your help! @ArrowLuo
The results is still poor than the R@1: 42.8
(seqTransf reported in paper - Table 5).
@HanielF Do you know the reason for this gap?
meanP的训练日志如下:
2022-02-25 16:59:39,551:INFO: Effective parameters: 2022-02-25 16:59:39,551:INFO: <<< batch_size: 128 2022-02-25 16:59:39,551:INFO: device: cuda:2 n_gpu: 8 2022-02-25 16:59:39,551:INFO: device: cuda:4 n_gpu: 8 2022-02-25 16:59:39,551:INFO: device: cuda:1 n_gpu: 8 2022-02-25 16:59:39,551:INFO: <<< batch_size_val: 16 2022-02-25 16:59:39,551:INFO: <<< cache_dir: 2022-02-25 16:59:39,551:INFO: <<< coef_lr: 0.001 2022-02-25 16:59:39,552:INFO: <<< cross_model: cross-base 2022-02-25 16:59:39,552:INFO: <<< cross_num_hidden_layers: 4 2022-02-25 16:59:39,552:INFO: <<< data_path: /root/wanghaoran09/xudi03/CLIP4Clip/data/DiDeMo 2022-02-25 16:59:39,552:INFO: <<< datatype: didemo 2022-02-25 16:59:39,552:INFO: <<< do_eval: False 2022-02-25 16:59:39,552:INFO: <<< do_lower_case: False 2022-02-25 16:59:39,552:INFO: <<< do_pretrain: False 2022-02-25 16:59:39,552:INFO: device: cuda:3 n_gpu: 8 2022-02-25 16:59:39,552:INFO: <<< do_train: True 2022-02-25 16:59:39,552:INFO: <<< epochs: 10 2022-02-25 16:59:39,552:INFO: <<< eval_frame_order: 0 2022-02-25 16:59:39,552:INFO: <<< expand_msrvtt_sentences: False 2022-02-25 16:59:39,552:INFO: device: cuda:6 n_gpu: 8 2022-02-25 16:59:39,552:INFO: <<< feature_framerate: 1 2022-02-25 16:59:39,552:INFO: <<< features_path: /root/wanghaoran09/xudi03/CLIP4Clip/data/DiDeMo/DiDeMo_Compress 2022-02-25 16:59:39,552:INFO: <<< fp16: False 2022-02-25 16:59:39,552:INFO: <<< fp16_opt_level: O1 2022-02-25 16:59:39,552:INFO: <<< freeze_layer_num: 0 2022-02-25 16:59:39,552:INFO: <<< gradient_accumulation_steps: 1 2022-02-25 16:59:39,552:INFO: <<< hard_negative_rate: 0.5 2022-02-25 16:59:39,552:INFO: device: cuda:7 n_gpu: 8 2022-02-25 16:59:39,552:INFO: <<< init_model: None 2022-02-25 16:59:39,552:INFO: <<< linear_patch: 2d 2022-02-25 16:59:39,552:INFO: <<< local_rank: 0 2022-02-25 16:59:39,552:INFO: <<< loose_type: True 2022-02-25 16:59:39,553:INFO: <<< lr: 0.0001 2022-02-25 16:59:39,553:INFO: <<< lr_decay: 0.9 2022-02-25 16:59:39,553:INFO: <<< margin: 0.1 2022-02-25 16:59:39,553:INFO: <<< max_frames: 64 2022-02-25 16:59:39,553:INFO: <<< max_words: 64 2022-02-25 16:59:39,553:INFO: <<< n_display: 10 2022-02-25 16:59:39,553:INFO: <<< n_gpu: 1 2022-02-25 16:59:39,553:INFO: <<< n_pair: 1 2022-02-25 16:59:39,553:INFO: <<< negative_weighting: 1 2022-02-25 16:59:39,553:INFO: <<< num_thread_reader: 2 2022-02-25 16:59:39,553:INFO: <<< output_dir: ckpts/ckpt_didemo_retrieval_looseType_DiDeMo_meanP 2022-02-25 16:59:39,553:INFO: <<< pretrained_clip_name: ViT-B/32 2022-02-25 16:59:39,553:INFO: <<< rank: 0 2022-02-25 16:59:39,553:INFO: <<< resume_model: None 2022-02-25 16:59:39,553:INFO: <<< sampled_use_mil: False 2022-02-25 16:59:39,553:INFO: <<< seed: 42 2022-02-25 16:59:39,553:INFO: <<< sim_header: meanP 2022-02-25 16:59:39,553:INFO: <<< slice_framepos: 2 2022-02-25 16:59:39,553:INFO: <<< task_type: retrieval 2022-02-25 16:59:39,553:INFO: <<< text_num_hidden_layers: 12 2022-02-25 16:59:39,553:INFO: <<< train_csv: data/.train.csv 2022-02-25 16:59:39,553:INFO: <<< train_frame_order: 0 2022-02-25 16:59:39,553:INFO: <<< use_mil: False 2022-02-25 16:59:39,553:INFO: <<< val_csv: data/.val.csv 2022-02-25 16:59:39,554:INFO: <<< video_dim: 1024 2022-02-25 16:59:39,554:INFO: <<< visual_num_hidden_layers: 12 2022-02-25 16:59:39,554:INFO: <<< warmup_proportion: 0.1 2022-02-25 16:59:39,554:INFO: <<< world_size: 8 2022-02-25 16:59:39,554:INFO: device: cuda:0 n_gpu: 8 2022-02-25 16:59:40,579:INFO: loading archive file /root/wanghaoran09/xudi03/CLIP4Clip-master/modules/cross-base 2022-02-25 16:59:40,579:INFO: Model config { "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 512, "initializer_range": 0.02, "intermediate_size": 2048, "max_position_embeddings": 128, "num_attention_heads": 8, "num_hidden_layers": 4, "type_vocab_size": 2, "vocab_size": 512 } 2022-02-25 16:59:40,579:INFO: Weight doesn't exsits. /root/wanghaoran09/xudi03/CLIP4Clip-master/modules/cross-base/cross_pytorch_model.bin 2022-02-25 16:59:40,579:WARNING: Stage-One:True, Stage-Two:False 2022-02-25 16:59:40,579:WARNING: Test retrieval by loose type. 2022-02-25 16:59:40,580:WARNING: embed_dim: 512 2022-02-25 16:59:40,580:WARNING: image_resolution: 224 2022-02-25 16:59:40,580:WARNING: vision_layers: 12 2022-02-25 16:59:40,580:WARNING: vision_width: 768 2022-02-25 16:59:40,580:WARNING: vision_patch_size: 32 2022-02-25 16:59:40,580:WARNING: context_length: 77 2022-02-25 16:59:40,580:WARNING: vocab_size: 49408 2022-02-25 16:59:40,580:WARNING: transformer_width: 512 2022-02-25 16:59:40,580:WARNING: transformer_heads: 8 2022-02-25 16:59:40,580:WARNING: transformer_layers: 12 2022-02-25 16:59:40,580:WARNING: linear_patch: 2d 2022-02-25 16:59:40,580:WARNING: cut_top_layer: 0 2022-02-25 16:59:42,763:WARNING: sim_header: meanP 2022-02-25 16:59:52,833:INFO: -------------------- 2022-02-25 16:59:52,833:INFO: Weights from pretrained model not used in CLIP4Clip: clip.input_resolution clip.context_length clip.vocab_size ........ 2022-02-25 18:10:42,342:INFO: Epoch 4/10 Finished, Train Loss: 0.245592 2022-02-25 18:10:44,235:INFO: Model saved to ckpts/ckpt_didemo_retrieval_looseType_DiDeMo_meanP/pytorch_model.bin.3 2022-02-25 18:10:44,235:INFO: Optimizer saved to ckpts/ckpt_didemo_retrieval_looseType_DiDeMo_meanP/pytorch_opt.bin.3 2022-02-25 18:16:58,927:INFO: sim matrix size: 1003, 1003 2022-02-25 18:16:59,051:INFO: Length-T: 1003, Length-V:1003 2022-02-25 18:16:59,051:INFO: Text-to-Video: 2022-02-25 18:16:59,051:INFO: >>> R@1: 40.5 - R@5: 68.3 - R@10: 77.3 - Median R: 2.0 - Mean R: 18.7 2022-02-25 18:16:59,051:INFO: Video-to-Text: 2022-02-25 18:16:59,052:INFO: >>> V2T$R@1: 40.4 - V2T$R@5: 67.6 - V2T$R@10: 77.9 - V2T$Median R: 2.0 - V2T$Mean R: 12.4 ..........
Hello, excuse me. I read your paper: Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval, However, when I reproduced the code didemo, I found that I could not find train_data.json and other json files in the data set. I obtained all the data from clip4clip. """ video_json_path_dict = {} video_json_path_dict["train"] = os.path.join(self.data_path, "train_data_mp4.json") video_json_path_dict["val"] = os.path.join(self.data_path, "test_data_mp4.json") video_json_path_dict["test"] = os.path.join(self.data_path, "test_data_mp4.json")
I trained clip4clip on didemo dataset, and the R@1 of text-to-video is much worse than that shown in paper.
The metric reported in paper is 43.4 on DiDeMo when similarity calculator is
meanP
and is 42.8 when the head isseqTransf
.But according to my reproduction result, based on
meanP
, max t2v R@1 is only up to 40.5, and it only up to 40.2 based onseqTransf
.All settings remain the same as in the paper.