OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Apache License 2.0
1.3k stars 85 forks source link

Not able to reproduce the effects of finetune InterVideo1 #143

Open hardlipay opened 2 months ago

hardlipay commented 2 months ago

Hellow , nice job ! I can not reproduce the MSRVTT finetuned model,and I set each args as the log

Also I check each problems ,such as dataloade or the weights. I still can not do it.

And I see in the log ,you have a pretrained model which could different than mine , is right?

this one,could you provide the weight for me.Thank you!

pretrained_path: /mnt/lustre/share_data/liyizhuo/projects/all-in-one/outputs/outputs_cotrain/models/clip_kc_new_L14_vtc_cap_3plusM_step400k_bz1792/ensemble.ckpt

hardlipay commented 2 months ago

It is my log 2024-07-11 11:21:38,063:INFO: Effective parameters: 2024-07-11 11:21:38,063:INFO: device: cuda:4 n_gpu: 8 2024-07-11 11:21:38,063:INFO: device: cuda:5 n_gpu: 8 2024-07-11 11:21:38,063:INFO: device: cuda:3 n_gpu: 8 2024-07-11 11:21:38,063:INFO: device: cuda:7 n_gpu: 8 2024-07-11 11:21:38,063:INFO: device: cuda:6 n_gpu: 8 2024-07-11 11:21:38,063:INFO: device: cuda:1 n_gpu: 8 2024-07-11 11:21:38,063:INFO: device: cuda:2 n_gpu: 8 2024-07-11 11:21:38,063:INFO: <<< batch_size: 512 2024-07-11 11:21:38,063:INFO: <<< batch_size_val: 16 2024-07-11 11:21:38,063:INFO: <<< cache_dir: 2024-07-11 11:21:38,063:INFO: <<< cdcr: 0 2024-07-11 11:21:38,063:INFO: <<< clip_evl: False 2024-07-11 11:21:38,063:INFO: <<< coef_lr: 0.005 2024-07-11 11:21:38,063:INFO: <<< cross_model: cross-base 2024-07-11 11:21:38,063:INFO: <<< cross_num_hidden_layers: 4 2024-07-11 11:21:38,063:INFO: <<< data_path: ./msrvtt_data/MSRVTT_data.json 2024-07-11 11:21:38,063:INFO: <<< datatype: msrvtt 2024-07-11 11:21:38,063:INFO: <<< dist_url: tcp://127.0.0.1:29500 2024-07-11 11:21:38,063:INFO: <<< do_eval: False 2024-07-11 11:21:38,063:INFO: <<< do_lower_case: False 2024-07-11 11:21:38,063:INFO: <<< do_pretrain: False 2024-07-11 11:21:38,063:INFO: <<< do_train: True 2024-07-11 11:21:38,063:INFO: <<< epochs: 5 2024-07-11 11:21:38,063:INFO: <<< eval_frame_order: 0 2024-07-11 11:21:38,063:INFO: <<< expand_msrvtt_sentences: True 2024-07-11 11:21:38,063:INFO: <<< feature_framerate: 1 2024-07-11 11:21:38,063:INFO: <<< features_path: /openbayes/home/InternVideo/dataset/11_new 2024-07-11 11:21:38,063:INFO: <<< fp16: False 2024-07-11 11:21:38,063:INFO: <<< fp16_opt_level: O1 2024-07-11 11:21:38,064:INFO: <<< freeze_layer_num: 0 2024-07-11 11:21:38,064:INFO: <<< gpu: 0 2024-07-11 11:21:38,064:INFO: <<< gradient_accumulation_steps: 1 2024-07-11 11:21:38,064:INFO: <<< hard_negative_rate: 0.5 2024-07-11 11:21:38,064:INFO: <<< init_model: None 2024-07-11 11:21:38,064:INFO: <<< interaction: no 2024-07-11 11:21:38,064:INFO: <<< linear_patch: 2d 2024-07-11 11:21:38,064:INFO: <<< local_rank: 0 2024-07-11 11:21:38,064:INFO: <<< loose_type: True 2024-07-11 11:21:38,064:INFO: <<< lr: 0.001 2024-07-11 11:21:38,064:INFO: <<< lr_decay: 0.9 2024-07-11 11:21:38,064:INFO: <<< margin: 0.1 2024-07-11 11:21:38,064:INFO: <<< max_frames: 12 2024-07-11 11:21:38,064:INFO: <<< max_words: 77 2024-07-11 11:21:38,064:INFO: <<< mergeclip: False 2024-07-11 11:21:38,064:INFO: <<< mergeweight: 0.5 2024-07-11 11:21:38,064:INFO: <<< n_display: 50 2024-07-11 11:21:38,064:INFO: <<< n_gpu: 1 2024-07-11 11:21:38,064:INFO: <<< n_pair: 1 2024-07-11 11:21:38,064:INFO: <<< negative_weighting: 1 2024-07-11 11:21:38,064:INFO: <<< num_thread_reader: 16 2024-07-11 11:21:38,064:INFO: <<< output_dir: ./ret_mgpu_mydata_finetune_use_pretrained_path 2024-07-11 11:21:38,064:INFO: <<< pretrained_clip_name: /openbayes/home/InternVideo/dataset/pretrained_weights/clip/ViT-B-32.pt 2024-07-11 11:21:38,064:INFO: <<< pretrained_path: /openbayes/home/InternVideo/dataset/pretrained_weights/InternVideo-MM-L-14.ckpt 2024-07-11 11:21:38,064:INFO: <<< rank: 0 2024-07-11 11:21:38,064:INFO: <<< resume_model: None 2024-07-11 11:21:38,064:INFO: <<< sampled_use_mil: False 2024-07-11 11:21:38,064:INFO: <<< seed: 42 2024-07-11 11:21:38,064:INFO: <<< sim_header: meanP 2024-07-11 11:21:38,064:INFO: <<< slice_framepos: 2 2024-07-11 11:21:38,064:INFO: <<< task_type: retrieval 2024-07-11 11:21:38,064:INFO: <<< text_num_hidden_layers: 12 2024-07-11 11:21:38,064:INFO: <<< train_csv: ./msrvtt_data/MSRVTT_train.9k.csv 2024-07-11 11:21:38,064:INFO: <<< train_frame_order: 0 2024-07-11 11:21:38,064:INFO: <<< use_mil: False 2024-07-11 11:21:38,064:INFO: <<< val_csv: ./msrvtt_data/MSRVTT_JSFUSION_test.csv 2024-07-11 11:21:38,064:INFO: <<< video_dim: 1024 2024-07-11 11:21:38,064:INFO: <<< visual_num_hidden_layers: 12 2024-07-11 11:21:38,064:INFO: <<< warmup_proportion: 0.1 2024-07-11 11:21:38,065:INFO: <<< world_size: 8 2024-07-11 11:21:38,065:INFO: <<< wti_arch: 0 2024-07-11 11:21:38,065:INFO: device: cuda:0 n_gpu: 8 2024-07-11 11:21:39,913:INFO: loading archive file /output/InternVideo/InternVideo1/Downstream/Video-Text-Retrieval/modules/cross-base 2024-07-11 11:21:39,913:INFO: Model config { "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 512, "initializer_range": 0.02, "intermediate_size": 2048, "max_position_embeddings": 128, "num_attention_heads": 8, "num_hidden_layers": 4, "type_vocab_size": 2, "vocab_size": 512 } ....... 024-07-11 11:21:55,630:INFO: ***** Running training ***** 2024-07-11 11:21:55,631:INFO: Num examples = 180000 2024-07-11 11:21:55,631:INFO: Batch size = 512 2024-07-11 11:21:55,631:INFO: Num steps = 1755 2024-07-11 11:22:32,338:INFO: Reducer buckets have been rebuilt in this iteration. 2024-07-11 11:22:32,344:INFO: Reducer buckets have been rebuilt in this iteration. 2024-07-11 11:22:32,421:INFO: Reducer buckets have been rebuilt in this iteration. 2024-07-11 11:22:32,422:INFO: Reducer buckets have been rebuilt in this iteration. 2024-07-11 11:22:32,423:INFO: Reducer buckets have been rebuilt in this iteration. 2024-07-11 11:22:32,423:INFO: Reducer buckets have been rebuilt in this iteration. 2024-07-11 11:22:32,426:INFO: Reducer buckets have been rebuilt in this iteration. 2024-07-11 11:22:32,425:INFO: Reducer buckets have been rebuilt in this iteration. 2024-07-11 11:24:07,584:INFO: Epoch: 1/5, Step: 50/351, Lr: , Loss: 2.227273, Time/step: 2.635337 2024-07-11 11:25:32,854:INFO: Epoch: 1/5, Step: 100/351, Lr: , Loss: 2.295935, Time/step: 1.705309 2024-07-11 11:27:00,094:INFO: Epoch: 1/5, Step: 150/351, Lr: , Loss: 2.706636, Time/step: 1.744746 2024-07-11 11:28:26,450:INFO: Epoch: 1/5, Step: 200/351, Lr: , Loss: 2.736752, Time/step: 1.727095 2024-07-11 11:29:53,309:INFO: Epoch: 1/5, Step: 250/351, Lr: , Loss: 2.657757, Time/step: 1.737177 2024-07-11 11:31:19,335:INFO: Epoch: 1/5, Step: 300/351, Lr: , Loss: 2.597078, Time/step: 1.720514 2024-07-11 11:32:48,897:INFO: Epoch: 1/5, Step: 350/351, Lr: , Loss: 2.315990, Time/step: 1.791219 2024-07-11 11:32:51,493:INFO: Epoch 1/5 Finished, Train Loss: 2.562695 2024-07-11 11:32:56,494:INFO: Model saved to ./ret_mgpu_mydata_finetune_use_pretrained_path/pytorch_model.bin 2024-07-11 11:32:56,495:INFO: Optimizer saved to ./ret_mgpu_mydata_finetune_use_pretrained_path/pytorch_opt.bin 2024-07-11 11:33:16,583:INFO: sim matrix size: 1000, 1000 2024-07-11 11:33:16,768:INFO: Length-T: 1000, Length-V:1000 2024-07-11 11:33:16,768:INFO: ------------------------------------------------------------ 2024-07-11 11:33:16,768:INFO: DSL Text-to-Video: 2024-07-11 11:33:16,768:INFO: >>> R@1: 18.7 - R@5: 43.2 - R@10: 55.7 - Median R: 8.0 - Mean R: 45.0 2024-07-11 11:33:16,768:INFO: DSL Video-to-Text: 2024-07-11 11:33:16,768:INFO: >>> V2T$R@1: 18.0 - V2T$R@5: 42.6 - V2T$R@10: 54.4 - V2T$Median R: 8.5 - V2T$Mean R: 43.7 2024-07-11 11:33:16,768:INFO: ------------------------------------------------------------ 2024-07-11 11:33:16,768:INFO: Text-to-Video: 2024-07-11 11:33:16,769:INFO: >>> R@1: 15.7 - R@5: 39.4 - R@10: 52.6 - Median R: 9.0 - Mean R: 48.4 2024-07-11 11:33:16,769:INFO: Video-to-Text: 2024-07-11 11:33:16,769:INFO: >>> V2T$R@1: 15.1 - V2T$R@5: 39.3 - V2T$R@10: 50.7 - V2T$Median R: 10.0 - V2T$Mean R: 48.8 2024-07-11 11:33:16,770:INFO: The best model is: ./ret_mgpu_mydata_finetune_use_pretrained_path/pytorch_model.bin, the R1 is: 15.7000 2024-07-11 11:34:40,019:INFO: Epoch: 2/5, Step: 49/351, Lr: , Loss: 1.967601, Time/step: 1.661535 2024-07-11 11:36:04,756:INFO: Epoch: 2/5, Step: 99/351, Lr: , Loss: 1.975278, Time/step: 1.694740 2024-07-11 11:37:29,064:INFO: Epoch: 2/5, Step: 149/351, Lr: , Loss: 1.755215, Time/step: 1.686095 2024-07-11 11:38:54,342:INFO: Epoch: 2/5, Step: 199/351, Lr: , Loss: 1.821315, Time/step: 1.705289 2024-07-11 11:40:19,421:INFO: Epoch: 2/5, Step: 249/351, Lr: , Loss: 1.728463, Time/step: 1.701576 2024-07-11 11:41:45,084:INFO: Epoch: 2/5, Step: 299/351, Lr: , Loss: 1.622554, Time/step: 1.713192 2024-07-11 11:43:11,891:INFO: Epoch: 2/5, Step: 349/351, Lr: , Loss: 1.491672, Time/step: 1.736059 2024-07-11 11:43:16,443:INFO: Epoch 2/5 Finished, Train Loss: 1.796257 2024-07-11 11:43:19,608:INFO: Model saved to ./ret_mgpu_mydata_finetune_use_pretrained_path/pytorch_model.bin 2024-07-11 11:43:19,609:INFO: Optimizer saved to ./ret_mgpu_mydata_finetune_use_pretrained_path/pytorch_opt.bin 2024-07-11 11:43:36,846:INFO: sim matrix size: 1000, 1000 2024-07-11 11:43:37,040:INFO: Length-T: 1000, Length-V:1000 2024-07-11 11:43:37,040:INFO: ------------------------------------------------------------ 2024-07-11 11:43:37,040:INFO: DSL Text-to-Video: 2024-07-11 11:43:37,040:INFO: >>> R@1: 20.9 - R@5: 45.5 - R@10: 56.9 - Median R: 7.0 - Mean R: 38.2 2024-07-11 11:43:37,040:INFO: DSL Video-to-Text: 2024-07-11 11:43:37,040:INFO: >>> V2T$R@1: 21.0 - V2T$R@5: 45.2 - V2T$R@10: 55.9 - V2T$Median R: 7.5 - V2T$Mean R: 38.9 2024-07-11 11:43:37,040:INFO: ------------------------------------------------------------ 2024-07-11 11:43:37,040:INFO: Text-to-Video: 2024-07-11 11:43:37,040:INFO: >>> R@1: 20.3 - R@5: 43.1 - R@10: 55.2 - Median R: 8.0 - Mean R: 42.8 2024-07-11 11:43:37,040:INFO: Video-to-Text: 2024-07-11 11:43:37,040:INFO: >>> V2T$R@1: 18.8 - V2T$R@5: 42.0 - V2T$R@10: 54.3 - V2T$Median R: 8.0 - V2T$Mean R: 46.8 2024-07-11 11:43:37,042:INFO: The best model is: ./ret_mgpu_mydata_finetune_use_pretrained_path/pytorch_model.bin, the R1 is: 20.3000 2024-07-11 11:44:59,654:INFO: Epoch: 3/5, Step: 48/351, Lr: , Loss: 1.288813, Time/step: 1.648480 2024-07-11 11:46:27,554:INFO: Epoch: 3/5, Step: 98/351, Lr: , Loss: 1.066350, Time/step: 1.757993 2024-07-11 11:47:53,040:INFO: Epoch: 3/5, Step: 148/351, Lr: , Loss: 1.266637, Time/step: 1.709701 2024-07-11 11:49:20,329:INFO: Epoch: 3/5, Step: 198/351, Lr: , Loss: 1.178898, Time/step: 1.745760 2024-07-11 11:50:46,582:INFO: Epoch: 3/5, Step: 248/351, Lr: , Loss: 1.115819, Time/step: 1.724987 2024-07-11 11:52:14,526:INFO: Epoch: 3/5, Step: 298/351, Lr: , Loss: 1.098051, Time/step: 1.758858 2024-07-11 11:53:43,083:INFO: Epoch: 3/5, Step: 348/351, Lr: , Loss: 1.026722, Time/step: 1.771078 2024-07-11 11:53:49,132:INFO: Epoch 3/5 Finished, Train Loss: 1.159029 2024-07-11 11:53:52,302:INFO: Model saved to ./ret_mgpu_mydata_finetune_use_pretrained_path/pytorch_model.bin 2024-07-11 11:53:52,302:INFO: Optimizer saved to ./ret_mgpu_mydata_finetune_use_pretrained_path/pytorch_opt.bin 2024-07-11 11:54:09,578:INFO: sim matrix size: 1000, 1000 2024-07-11 11:54:09,769:INFO: Length-T: 1000, Length-V:1000 2024-07-11 11:54:09,769:INFO: ------------------------------------------------------------ 2024-07-11 11:54:09,769:INFO: DSL Text-to-Video: 2024-07-11 11:54:09,769:INFO: >>> R@1: 24.2 - R@5: 47.6 - R@10: 61.6 - Median R: 6.0 - Mean R: 37.8 2024-07-11 11:54:09,769:INFO: DSL Video-to-Text: 2024-07-11 11:54:09,769:INFO: >>> V2T$R@1: 23.7 - V2T$R@5: 48.1 - V2T$R@10: 61.1 - V2T$Median R: 6.0 - V2T$Mean R: 37.0 2024-07-11 11:54:09,769:INFO: ------------------------------------------------------------ 2024-07-11 11:54:09,769:INFO: Text-to-Video: 2024-07-11 11:54:09,769:INFO: >>> R@1: 21.7 - R@5: 48.2 - R@10: 59.9 - Median R: 6.0 - Mean R: 40.4 2024-07-11 11:54:09,769:INFO: Video-to-Text: 2024-07-11 11:54:09,769:INFO: >>> V2T$R@1: 19.0 - V2T$R@5: 44.7 - V2T$R@10: 56.5 - V2T$Median R: 7.0 - V2T$Mean R: 46.6 2024-07-11 11:54:09,771:INFO: The best model is: ./ret_mgpu_mydata_finetune_use_pretrained_path/pytorch_model.bin, the R1 is: 21.7000 2024-07-11 11:55:30,040:INFO: Epoch: 4/5, Step: 47/351, Lr: , Loss: 0.729010, Time/step: 1.601594 2024-07-11 11:56:57,121:INFO: Epoch: 4/5, Step: 97/351, Lr: , Loss: 0.806269, Time/step: 1.741610 2024-07-11 11:58:22,053:INFO: Epoch: 4/5, Step: 147/351, Lr: , Loss: 0.769861, Time/step: 1.698622 2024-07-11 11:59:49,861:INFO: Epoch: 4/5, Step: 197/351, Lr: , Loss: 0.791283, Time/step: 1.756077 2024-07-11 12:01:16,158:INFO: Epoch: 4/5, Step: 247/351, Lr: , Loss: 0.659234, Time/step: 1.725929 2024-07-11 12:02:43,180:INFO: Epoch: 4/5, Step: 297/351, Lr: , Loss: 0.681899, Time/step: 1.740434 2024-07-11 12:04:11,277:INFO: Epoch: 4/5, Step: 347/351, Lr: , Loss: 0.799178, Time/step: 1.761938 2024-07-11 12:04:19,146:INFO: Epoch 4/5 Finished, Train Loss: 0.740242 2024-07-11 12:04:22,161:INFO: Model saved to ./ret_mgpu_mydata_finetune_use_pretrained_path/pytorch_model.bin 2024-07-11 12:04:22,162:INFO: Optimizer saved to ./ret_mgpu_mydata_finetune_use_pretrained_path/pytorch_opt.bin 2024-07-11 12:04:39,896:INFO: sim matrix size: 1000, 1000 2024-07-11 12:04:40,081:INFO: Length-T: 1000, Length-V:1000 2024-07-11 12:04:40,082:INFO: ------------------------------------------------------------ 2024-07-11 12:04:40,082:INFO: DSL Text-to-Video: 2024-07-11 12:04:40,082:INFO: >>> R@1: 23.2 - R@5: 49.4 - R@10: 60.7 - Median R: 6.0 - Mean R: 38.2 2024-07-11 12:04:40,082:INFO: DSL Video-to-Text: 2024-07-11 12:04:40,082:INFO: >>> V2T$R@1: 22.0 - V2T$R@5: 48.3 - V2T$R@10: 61.3 - V2T$Median R: 6.0 - V2T$Mean R: 37.8 2024-07-11 12:04:40,082:INFO: ------------------------------------------------------------ 2024-07-11 12:04:40,082:INFO: Text-to-Video: 2024-07-11 12:04:40,082:INFO: >>> R@1: 22.0 - R@5: 49.5 - R@10: 60.6 - Median R: 6.0 - Mean R: 41.4 2024-07-11 12:04:40,082:INFO: Video-to-Text: 2024-07-11 12:04:40,082:INFO: >>> V2T$R@1: 20.3 - V2T$R@5: 44.4 - V2T$R@10: 56.2 - V2T$Median R: 8.0 - V2T$Mean R: 48.2 2024-07-11 12:04:40,083:INFO: The best model is: ./ret_mgpu_mydata_finetune_use_pretrained_path/pytorch_model.bin, the R1 is: 22.0000 2024-07-11 12:05:58,897:INFO: Epoch: 5/5, Step: 46/351, Lr: , Loss: 0.583870, Time/step: 1.572555 2024-07-11 12:07:24,675:INFO: Epoch: 5/5, Step: 96/351, Lr: , Loss: 0.448119, Time/step: 1.715497 2024-07-11 12:08:49,759:INFO: Epoch: 5/5, Step: 146/351, Lr: , Loss: 0.492565, Time/step: 1.701642 2024-07-11 12:10:16,849:INFO: Epoch: 5/5, Step: 196/351, Lr: , Loss: 0.521736, Time/step: 1.741791 2024-07-11 12:11:42,517:INFO: Epoch: 5/5, Step: 246/351, Lr: , Loss: 0.593410, Time/step: 1.713338 2024-07-11 12:13:10,964:INFO: Epoch: 5/5, Step: 296/351, Lr: , Loss: 0.528416, Time/step: 1.768792 2024-07-11 12:14:38,727:INFO: Epoch: 5/5, Step: 346/351, Lr: , Loss: 0.520307, Time/step: 1.755247 2024-07-11 12:14:48,573:INFO: Epoch 5/5 Finished, Train Loss: 0.503294 2024-07-11 12:14:50,565:INFO: Model saved to ./ret_mgpu_mydata_finetune_use_pretrained_path/pytorch_model.bin 2024-07-11 12:14:50,565:INFO: Optimizer saved to ./ret_mgpu_mydata_finetune_use_pretrained_path/pytorch_opt.bin 2024-07-11 12:14:58,720:INFO: sim matrix size: 1000, 1000 2024-07-11 12:14:58,905:INFO: Length-T: 1000, Length-V:1000 2024-07-11 12:14:58,905:INFO: ------------------------------------------------------------ 2024-07-11 12:14:58,905:INFO: DSL Text-to-Video: 2024-07-11 12:14:58,905:INFO: >>> R@1: 22.6 - R@5: 47.3 - R@10: 59.7 - Median R: 6.0 - Mean R: 39.5 2024-07-11 12:14:58,905:INFO: DSL Video-to-Text: 2024-07-11 12:14:58,905:INFO: >>> V2T$R@1: 21.3 - V2T$R@5: 46.5 - V2T$R@10: 59.7 - V2T$Median R: 7.0 - V2T$Mean R: 40.6 2024-07-11 12:14:58,905:INFO: ------------------------------------------------------------ 2024-07-11 12:14:58,905:INFO: Text-to-Video: 2024-07-11 12:14:58,905:INFO: >>> R@1: 22.0 - R@5: 48.5 - R@10: 59.5 - Median R: 6.0 - Mean R: 43.5 2024-07-11 12:14:58,905:INFO: Video-to-Text: 2024-07-11 12:14:58,905:INFO: >>> V2T$R@1: 17.6 - V2T$R@5: 41.8 - V2T$R@10: 53.3 - V2T$Median R: 9.0 - V2T$Mean R: 54.3 2024-07-11 12:14:58,906:INFO: The best model is: ./ret_mgpu_mydata_finetune_use_pretrained_path/pytorch_model.bin, the R1 is: 22.0000