Closed AAUfoa closed 2 years ago
Hi @AAUfoa, plz ignore the cross_pytorch_model.bin
, which is not used in the project. We left it out when we released the code. It is normal for the 2 and 3 mentioned above. I am not sure what happened to your divergency of loss, and can you print your command and part of loss here?
It is also normal that each GPU contains two subprocesses, due to we evaluate parallelly with cached features. See here. Best~
谢谢回复
Hi @AAUfoa, you'd better kill all subprocesses before rerunning the command, and there is no other place that can generate a new process. Validation is time-consuming, using muti-GPUs for speed up.
感谢回复~ “you'd better kill all subprocesses before rerunning the command, and there is no other place that can generate a new process. Validation is time-consuming, using muti-GPUs for speed up.”
Here is part of the log: 2022-05-18 18:14:11,322:INFO: device: cuda:2 n_gpu: 8 2022-05-18 18:14:11,368:INFO: <<< batch_size: 64 2022-05-18 18:14:11,427:INFO: <<< batch_size_val: 8 2022-05-18 18:14:11,444:INFO: <<< cache_dir: 2022-05-18 18:14:11,459:INFO: <<< coef_lr: 1.0 2022-05-18 18:14:11,468:INFO: <<< cross_model: cross-base 2022-05-18 18:14:11,477:INFO: <<< cross_num_hidden_layers: 4 2022-05-18 18:14:11,486:INFO: <<< data_path: xxx/NLG/DATA/msrvtt_data/MSRVTT_data.json 2022-05-18 18:14:11,495:INFO: <<< datatype: msrvtt 2022-05-18 18:14:11,505:INFO: <<< do_eval: False 2022-05-18 18:14:11,514:INFO: <<< do_lower_case: False 2022-05-18 18:14:11,523:INFO: <<< do_pretrain: False 2022-05-18 18:14:11,533:INFO: <<< do_train: True 2022-05-18 18:14:11,542:INFO: <<< epochs: 100 2022-05-18 18:14:11,551:INFO: <<< eval_frame_order: 0 2022-05-18 18:14:11,561:INFO: <<< expand_msrvtt_sentences: False 2022-05-18 18:14:11,570:INFO: <<< feature_framerate: 1 2022-05-18 18:14:11,579:INFO: <<< features_path: xxx/NLG/DATA/ 2022-05-18 18:14:12022-05-18 18:14:11,589:INFO: <<< fp1 2022-05-12022-05-18 18:14:11,598:INFO: <<< fp16_opt_level: O1 2022-05-18 18:14:11,607:INFO: <<< freeze_layer_num: 0 2022-05-18 18:14:11,617:INFO: <<< gradient_accumulation_steps: 1 2022-05-18 18:14:11,626:INFO: <<< hard_negative_rate: 0.5 2022-05-18 18:14:11,635:INFO: <<< init_model: None 2022-05-18 18:14:11,645:INFO: <<< linear_patch: 2d 2022-05-18 18:14:11,654:INFO: <<< local_rank: 0 2022-05-18 18:14:11,663:INFO: <<< loose_type: True 2022-05-18 18:14:11,672:INFO: <<< lr: 0.0001 2022-05-18 18:14:11,689:INFO: <<< lr_decay: 0.9 2022-05-18 18:14:11,708:INFO: <<< margin: 0.1 2022-05-18 18:14:11,717:INFO: <<< max_frames: 20 2022-05-18 18:14:11,726:INFO: <<< max_words: 20 2022-05-18 18:14:11,736:INFO: <<< n_display: 10 2022-05-18 18:14:11,745:INFO: <<< n_gpu: None 2022-05-18 18:14:11,754:INFO: <<< n_pair: 1 2022-05-18 18:14:11,763:INFO: <<< negative_weighting: 1 2022-05-18 18:14:11,771:INFO: <<< num_thread_reader: 1 2022-05-18 18:14:11,780:INFO: <<< output_dir: ckpts/ckpt_msrvtt_retrieval_looseType 2022-05-18 18:14:11,789:INFO: <<< pretrained_clip_name: ViT-B/16 2022-05-18 18:14:11,798:INFO: <<< rank: 8 2022-05-18 18:14:11,806:INFO: <<< resume_model: None 2022-05-18 18:14:11,816:INFO: <<< sampled_use_mil: False 2022-05-18 18:14:11,825:INFO: <<< seed: 42 2022-05-18 18:14:11,834:INFO: <<< sim_header: meanP 2022-05-18 18:14:11,842:INFO: <<< slice_framepos: 0 2022-05-18 18:14:11,851:INFO: <<< task_type: retrieval 2022-05-18 18:14:11,860:INFO: <<< text_num_hidden_la2022-05-12022-05-18 18:14:11,868:INFO: <<< train_csv2022-05-18 18:14:11,870:INFO: <<< task_type: retrieval 2022-05-18 18:14:11,879:INFO: <<< text_num_hidden_layers: 12 2022-05-18 18:14:11,888:INFO: <<< train_csv: xxx 2022-05-18 18:14:11,906:INFO: <<< use_mil: False 2022-05-18 18:14:11,914:INFO: <<< val_csv: 2022-05-18 18:14:11,939:INFO: device: cuda:0 n_gpu: 8
2022-05-18 18:14:11,941:INFO: <<< warmup_proportion: 0.1 2022-05-18 18:14:11,950:INFO: <<< world_size: 16 2022-05-18 18:14:11,959:INFO: device: cuda:0 n_gpu: 8 2022-05-18 18:14:14,147:INFO: loadin 2022-05-18 18:14:14,158:INFO: Model config { "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 512, "initializer_range": 0.02, "intermediate_size": 2048, "max_position_embeddings": 128, "num_attention_heads": 8, "num_hidden_layers": 4, "type_vocab_size": 2, "vocab_size": 512 }
2022-05-18 18:14:14,174:INFO: Weight doesn't exsits. xxs/cross-base/cross_pytorch_model.bin 2022-05-18 18:14:14,190:WARNING: Stage-One:True, Stage-Two:False 2022-05-18 18:14:14,205:WARNING: Test retrieval by loose type. 2022-05-18 18:14:14,230:WARNING: embed_dim: 512 2022-05-18 18:14:14,239:WARNING: image_resolution: 224 2022-05-18 18:14:14,248:WARNING: vision_layers: 12 2022-05-18 18:14:14,256:WARNING: vision_width: 768 2022-05-18 18:14:14,265:WARNING: vision_patch_size: 16 2022-05-18 18:14:14,274:WARNING: context_length: 77 2022-05-18 18:14:14,283:WARNING: vocab_size: 49408 2022-05-18 18:14:14,291:WARNING: transformer_width: 512 2022-05-18 18:14:14,300:WARNING: transformer_heads: 8 2022-05-18 18:14:14,309:WARNING: transformer_layers: 12 2022-05-18 18:14:14,318:WARNING: linear_patch: 2d 2022-05-18 18:14:14,327:WARNING: cut_top_layer: 0 2022-05-18 18:14:16,489:WARNING: sim_header: meanP 2022-05-18 18:14:24,620:INFO: -------------------- 2022-05-18 18:14:24,629:INFO: Weights from pretrained model not used in CLIP4Clip: clip.input_resolution clip.context_length clip.vocab_size 2022-05-18 18:14:24,747:INFO: Running test 2022-05-18 18:14:24,757:INFO: Num examples = 1000 2022-05-18 18:14:24,767:INFO: Batch size = 8 2022-05-18 18:14:24,777:INFO: Num steps = 125 2022-05-18 18:14:24,786:INFO: Running val 2022-05-18 18:14:24,795:INFO: Num examples = 1000 2022-05-18 18:14:25,336:INFO: Running training 2022-05-18 18:14:25,345:INFO: Num examples = 9000 2022-05-18 18:14:25,355:INFO: Batch size = 64 2022-05-18 18:14:25,365:INFO: Num steps = 7000 2022-05-18 18:15:21,744:INFO: Epoch: 1/100, Step: 10/70, Lr: 0.000001429, Loss: 1.948961, Time/step: 5.636896 2022-05-18 18:16:11,168:INFO: Epoch: 1/100, Step: 20/70, Lr: 0.000002857, Loss: 2.375628, Time/step: 4.941260 2022-05-18 18:16:59,667:INFO: Epoch: 1/100, Step: 30/70, Lr: 0.000004286, Loss: 3.105366, Time/step: 4.848969 2022-05-18 18:17:49,011:INFO: Epoch: 1/100, Step: 40/70, Lr: 0.000005714, Loss: 4.813864, Time/step: 4.933422 2022-05-18 18:18:37,438:INFO: Epoch: 1/100, Step: 50/70, Lr: 0.000007143, Loss: 4.856089, Time/step: 4.841641 2022-05-18 18:19:28,204:INFO: Epoch: 1/100, Step: 60/70, Lr: 0.000008571, Loss: 4.836089, Time/step: 5.075454 2022-05-18 18:20:18,657:INFO: Epoch: 1/100, Step: 70/70, Lr: 0.000010000, Loss: 4.832625, Time/step: 5.044148 2022-05-18 18:20:18,818:INFO: Epoch 1/100 Finished, Train Loss: 3.743831 2022-05-18 18:20:57,046:INFO: Model saved to ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_model.bin.0 2022-05-18 18:20:57,056:INFO: Optimizer saved to ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_opt.bin.0 2022-05-18 18:20:57,065:INFO: Eval on val dataset 2022-05-18 18:27:06,559:INFO: sim matrix size: 1000, 1000 2022-05-18 18:27:06,678:INFO: Length-T: 1000, Length-V:1000 2022-05-18 18:27:06,688:INFO: Text-to-Video: 2022-05-18 18:27:06,697:INFO: >>> R@1: 0.1 - R@5: 0.6 - R@10: 1.6 - Median R: 400.0 - Mean R: 426.8 2022-05-18 18:27:06,706:INFO: Video-to-Text: 2022-05-18 18:27:06,714:INFO: >>> V2T$R@1: 0.1 - V2T$R@5: 0.5 - V2T$R@10: 1.0 - V2T$Median R: 486.0 - V2T$Mean R: 491.6 2022-05-18 18:27:06,725:INFO: The best model is: ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_model.bin.0, the R1 is: 0.0999 2022-05-18 18:27:15,255:INFO: sim matrix size: 1000, 1000 2022-05-18 18:27:15,371:INFO: Length-T: 1000, Length-V:1000 2022-05-18 18:27:15,380:INFO: Text-to-Video: 2022-05-18 18:27:15,389:INFO: >>> R@1: 0.1 - R@5: 0.6 - R@10: 1.6 - Median R: 400.0 - Mean R: 426.8 2022-05-18 18:27:15,398:INFO: Video-to-Text: 2022-05-18 18:27:15,407:INFO: >>> V2T$R@1: 0.1 - V2T$R@5: 0.5 - V2T$R@10: 1.0 - V2T$Median R: 486.0 - V2T$Mean R: 491.6 2022-05-18 18:27:15,417:INFO: The best model is: ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_model.bin.0, the R1 is: 0.0999 2022-05-18 18:28:08,105:INFO: Epoch: 2/100, Step: 10/70, Lr: 0.000011429, Loss: 4.852544, Time/step: 6.062759 2022-05-18 18:28:57,516:INFO: Epoch: 2/100, Step: 20/70, Lr: 0.000012857, Loss: 4.852169, Time/step: 4.940400 2022-05-18 18:29:46,217:INFO: Epoch: 2/100, Step: 30/70, Lr: 0.000014286, Loss: 4.852092, Time/step: 4.869041 2022-05-18 18:30:36,893:INFO: Epoch: 2/100, Step: 40/70, Lr: 0.000015714, Loss: 4.852071, Time/step: 5.066491 2022-05-18 18:31:25,697:INFO: Epoch: 2/100, Step: 50/70, Lr: 0.000017143, Loss: 4.852052, Time/step: 4.894175 2022-05-18 18:32:16,210:INFO: Epoch: 2/100, Step: 60/70, Lr: 0.000018571, Loss: 4.852050, Time/step: 5.051291 2022-05-18 18:33:04,678:INFO: Epoch: 2/100, Step: 70/70, Lr: 0.000020000, Loss: 4.852046, Time/step: 4.845736 2022-05-18 18:33:04,841:INFO: Epoch 2/100 Finished, Train Loss: 4.866794 2022-05-18 18:33:42,105:INFO: Model saved to ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_model.bin.1 2022-05-18 18:33:42,116:INFO: Optimizer saved to ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_opt.bin.1 2022-05-18 18:33:42,125:INFO: Eval on val dataset 2022-05-18 18:39:58,879:INFO: sim matrix size: 1000, 1000 2022-05-18 18:39:58,995:INFO: Length-T: 1000, Length-V:1000 2022-05-18 18:39:59,004:INFO: Text-to-Video: 2022-05-18 18:39:59,013:INFO: >>> R@1: 0.1 - R@5: 0.5 - R@10: 1.0 - Median R: 498.0 - Mean R: 501.2 2022-05-18 18:39:59,022:INFO: Video-to-Text: 2022-05-18 18:39:59,046:INFO: >>> V2T$R@1: 0.1 - V2T$R@5: 0.5 - V2T$R@10: 0.9 - V2T$Median R: 495.5 - V2T$Mean R: 496.0 2022-05-18 18:39:59,057:INFO: The best model is: ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_model.bin.0, the R1 is: 0.0999 2022-05-18 18:49:20,642:INFO: Model saved to ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_model.bin.1 2022-05-18 18:49:20,653:INFO: Optimizer saved to ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_opt.bin.1 2022-05-18 18:49:20,662:INFO: Eval on val dataset 2022-05-18 18:55:33,331:INFO: sim matrix size: 1000, 1000 2022-05-18 18:55:33,451:INFO: Length-T: 1000, Length-V:1000 2022-05-18 18:55:33,461:INFO: Text-to-Video: 2022-05-18 18:55:33,470:INFO: >>> R@1: 0.1 - R@5: 0.5 - R@10: 1.0 - Median R: 498.0 - Mean R: 501.2 2022-05-18 18:55:33,479:INFO: Video-to-Text: 2022-05-18 18:55:33,489:INFO: >>> V2T$R@1: 0.1 - V2T$R@5: 0.5 - V2T$R@10: 0.9 - V2T$Median R: 495.5 - V2T$Mean R: 496.0
2022-05-18 23:49:55,059:INFO: Epoch: 26/100, Step: 10/70, Lr: 0.000085196, Loss: 4.852087, Time/step: 5.116916 2022-05-18 23:50:43,774:INFO: Epoch: 26/100, Step: 20/70, Lr: 0.000085037, Loss: 4.852105, Time/step: 4.870379 2022-05-18 23:51:33,284:INFO: Epoch: 26/100, Step: 30/70, Lr: 0.000084876, Loss: 4.852119, Time/step: 4.955218 2022-05-18 23:52:23,826:INFO: Epoch: 26/100, Step: 40/70, Lr: 0.000084715, Loss: 4.852106, Time/step: 5.059421 2022-05-18 23:53:14,460:INFO: Epoch: 26/100, Step: 50/70, Lr: 0.000084553, Loss: 4.852108, Time/step: 5.071075 2022-05-18 23:54:03,428:INFO: Epoch: 26/100, Step: 60/70, Lr: 0.000084391, Loss: 4.852115, Time/step: 4.895767 2022-05-18 23:54:52,247:INFO: Epoch: 26/100, Step: 70/70, Lr: 0.000084227, Loss: 4.852139, Time/step: 4.884241 2022-05-18 23:54:52,422:INFO: Epoch 26/100 Finished, Train Loss: 4.852116 2022-05-18 23:55:31,717:INFO: Model saved to ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_model.bin.25 2022-05-18 23:55:31,726:INFO: Optimizer saved to ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_opt.bin.25 2022-05-18 23:55:31,735:INFO: Eval on val dataset 2022-05-19 00:01:44,568:INFO: sim matrix size: 1000, 1000 2022-05-19 00:01:44,682:INFO: Length-T: 1000, Length-V:1000 2022-05-19 00:01:44,692:INFO: Text-to-Video: 2022-05-19 00:01:44,701:INFO: >>> R@1: 0.1 - R@5: 0.5 - R@10: 1.0 - Median R: 497.0 - Mean R: 499.2 2022-05-19 00:01:44,710:INFO: Video-to-Text: 2022-05-19 00:01:44,720:INFO: >>> V2T$R@1: 0.1 - V2T$R@5: 0.5 - V2T$R@10: 0.9 - V2T$Median R: 493.0 - V2T$Mean R: 497.7 2022-05-19 00:01:44,739:INFO: The best model is: ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_model.bin.8, the R1 is: 0.2913 2022-05-19 00:01:48,270:INFO: sim matrix size: 1000, 1000 2022-05-19 00:01:48,385:INFO: Length-T: 1000, Length-V:1000 2022-05-19 00:01:48,394:INFO: Text-to-Video: 2022-05-19 00:01:48,403:INFO: >>> R@1: 0.1 - R@5: 0.5 - R@10: 1.0 - Median R: 497.0 - Mean R: 499.2 2022-05-19 00:01:48,412:INFO: Video-to-Text: 2022-05-19 00:01:48,421:INFO: >>> V2T$R@1: 0.1 - V2T$R@5: 0.5 - V2T$R@10: 0.9 - V2T$Median R: 493.0 - V2T$Mean R: 497.7 2022-05-19 00:01:48,431:INFO: The best model is: ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_model.bin.8, the R1 is: 0.2913 2022-05-19 00:02:40,958:INFO: Epoch: 27/100, Step: 10/70, Lr: 0.000084063, Loss: 4.852103, Time/step: 5.179591 2022-05-19 00:03:30,785:INFO: Epoch: 27/100, Step: 20/70, Lr: 0.000083899, Loss: 4.852144, Time/step: 4.981609 2022-05-19 00:04:18,871:INFO: Epoch: 27/100, Step: 30/70, Lr: 0.000083734, Loss: 4.852153, Time/step: 4.807607 2022-05-19 00:05:09,239:INFO: Epoch: 27/100, Step: 40/70, Lr: 0.000083568, Loss: 4.852136, Time/step: 5.034721 2022-05-19 00:05:59,481:INFO: Epoch: 27/100, Step: 50/70, Lr: 0.000083401, Loss: 4.852140, Time/step: 5.023072 2022-05-19 00:06:48,757:INFO: Epoch: 27/100, Step: 60/70, Lr: 0.000083234, Loss: 4.852110, Time/step: 4.926563 2022-05-19 00:07:39,039:INFO: Epoch: 27/100, Step: 70/70, Lr: 0.000083066, Loss: 4.852122, Time/step: 5.027252 2022-05-19 00:07:39,214:INFO: Epoch 27/100 Finished, Train Loss: 4.852127 2022-05-19 00:08:17,912:INFO: Model saved to ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_model.bin.26 2022-05-19 00:08:17,921:INFO: Optimizer saved to ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_opt.bin.26 2022-05-19 00:08:17,931:INFO: Eval on val dataset 2022-05-19 00:14:28,018:INFO: sim matrix size: 1000, 1000 2022-05-19 00:14:28,132:INFO: Length-T: 1000, Length-V:1000 2022-05-19 00:14:28,143:INFO: Text-to-Video: 2022-05-19 00:14:28,158:INFO: >>> R@1: 0.1 - R@5: 0.5 - R@10: 1.0 - Median R: 495.0 - Mean R: 496.6 2022-05-19 00:14:28,174:INFO: Video-to-Text: 2022-05-19 00:14:28,183:INFO: >>> V2T$R@1: 0.1 - V2T$R@5: 0.5 - V2T$R@10: 0.9 - V2T$Median R: 505.5 - V2T$Mean R: 501.8 2022-05-19 00:14:28,194:INFO: The best model is: ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_model.bin.8, the R1 is: 0.2913 2022-05-19 00:14:32,927:INFO: sim matrix size: 1000, 1000 2022-05-19 00:14:33,040:INFO: Length-T: 1000, Length-V:1000 2022-05-19 00:14:33,049:INFO: Text-to-Video:
Hi @AAUfoa, I do not think that your hyper-parameters are the same as ours. As a glimpse,
1) The fine-tuning does not need 100 epochs, and we use 5 epochs in our experiments,
2) We set coef_lr
as 1e-3,
The hyper-parameters are sensitive to the fine-tuning process. More details can be found in our paper.
I am not sure what is wrong with the parallel inference.
谢谢,
修改了参数,目前loss能下降,大致的结果已经复现出来,下面是log: 5/20/2022 17:56:47 - INFO - The best model is: ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_model.bin.2, the R1 is: 39.1000 05/20/2022 17:57:40 - INFO - Epoch: 4/10, Step: 10/70, Lr: 0.000000078, Loss: 0.895030, Time/step: 4.597697 05/20/2022 17:58:29 - INFO - Epoch: 4/10, Step: 20/70, Lr: 0.000000076, Loss: 0.823528, Time/step: 4.913576 05/20/2022 17:59:17 - INFO - Epoch: 4/10, Step: 30/70, Lr: 0.000000074, Loss: 1.059354, Time/step: 4.766794 05/20/2022 18:00:07 - INFO - Epoch: 4/10, Step: 40/70, Lr: 0.000000072, Loss: 0.845383, Time/step: 5.004029 05/20/2022 18:00:56 - INFO - Epoch: 4/10, Step: 50/70, Lr: 0.000000070, Loss: 0.942577, Time/step: 4.901309 05/20/2022 18:01:45 - INFO - Epoch: 4/10, Step: 60/70, Lr: 0.000000068, Loss: 0.861079, Time/step: 4.896887 05/20/2022 18:02:31 - INFO - Epoch: 4/10, Step: 70/70, Lr: 0.000000065, Loss: 0.863185, Time/step: 4.618109 05/20/2022 18:02:31 - INFO - Epoch 4/10 Finished, Train Loss: 0.942608 05/20/2022 18:03:10 - INFO - Model saved to ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_model.bin.3 05/20/2022 18:03:10 - INFO - Optimizer saved to ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_opt.bin.3 05/20/2022 18:03:10 - INFO - Eval on val dataset 05/20/2022 18:09:01 - INFO - sim matrix size: 1000, 1000 05/20/2022 18:09:01 - INFO - Length-T: 1000, Length-V:1000 05/20/2022 18:09:01 - INFO - Text-to-Video: 05/20/2022 18:09:01 - INFO - >>> R@1: 41.4 - R@5: 67.9 - R@10: 78.3 - Median R: 2.0 - Mean R: 16.2 05/20/2022 18:09:01 - INFO - Video-to-Text: 05/20/2022 18:09:01 - INFO - >>> V2T$R@1: 40.8 - V2T$R@5: 69.8 - V2T$R@10: 79.1 - V2T$Median R: 2.0 - V2T$Mean R: 12.8 05/20/2022 18:09:01 - INFO - The best model is: ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_model.bin.3, the R1 is: 41.4000 05/20/2022 18:09:58 - INFO - Epoch: 5/10, Step: 10/70, Lr: 0.000000063, Loss: 0.840977, Time/step: 5.674717 05/20/2022 18:10:49 - INFO - Epoch: 5/10, Step: 20/70, Lr: 0.000000061, Loss: 0.870136, Time/step: 5.116187 05/20/2022 18:11:37 - INFO - Epoch: 5/10, Step: 30/70, Lr: 0.000000059, Loss: 1.019610, Time/step: 4.853574 05/20/2022 18:12:29 - INFO - Epoch: 5/10, Step: 40/70, Lr: 0.000000057, Loss: 0.837152, Time/step: 5.183086 05/20/2022 18:13:20 - INFO - Epoch: 5/10, Step: 50/70, Lr: 0.000000054, Loss: 0.743972, Time/step: 5.097485 05/20/2022 18:14:11 - INFO - Epoch: 5/10, Step: 60/70, Lr: 0.000000052, Loss: 0.763641, Time/step: 5.043708 05/20/2022 18:14:55 - INFO - Epoch: 5/10, Step: 70/70, Lr: 0.000000050, Loss: 0.929697, Time/step: 4.430235 05/20/2022 18:14:55 - INFO - Epoch 5/10 Finished, Train Loss: 0.884258 05/20/2022 18:15:34 - INFO - Model saved to ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_model.bin.4 05/20/2022 18:15:34 - INFO - Optimizer saved to ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_opt.bin.4 05/20/2022 18:15:34 - INFO - Eval on val dataset 05/20/2022 18:21:26 - INFO - sim matrix size: 1000, 1000 05/20/2022 18:21:26 - INFO - Length-T: 1000, Length-V:1000 05/20/2022 18:21:26 - INFO - Text-to-Video: 05/20/2022 18:21:26 - INFO - >>> R@1: 40.5 - R@5: 69.9 - R@10: 79.4 - Median R: 2.0 - Mean R: 16.1 05/20/2022 18:21:26 - INFO - Video-to-Text: 05/20/2022 18:21:26 - INFO - >>> V2T$R@1: 41.6 - V2T$R@5: 71.5 - V2T$R@10: 80.4 - V2T$Median R: 2.0 - V2T$Mean R: 12.3 05/20/2022 18:21:26 - INFO - The best model is: ckpts/ckpt_msrvtt_retrieval_looseType/pytorch_model.bin.3,
双机16卡,这是单机上的进程信息,目前还是没有解决,比较好奇。如果您那边能有机器看一下情况? root 42166 21.8 0.8 482005316 2110888 pts/0 Sl 18:27 0:44 /usr/bin/python -u my_main_task_retrieval.py --local_ran root 42224 22.3 0.8 481992768 2079488 pts/0 Sl 18:27 0:45 /usr/bin/python -u my_main_task_retrieval.py --local_ran root 42225 23.6 0.8 481979304 2104892 pts/0 Sl 18:27 0:48 /usr/bin/python -u my_main_task_retrieval.py --local_ran root 42227 20.5 1.1 482698336 2787712 pts/0 Sl 18:27 0:42 /usr/bin/python -u my_main_task_retrieval.py --local_ran root 42228 19.6 1.1 482698336 2783744 pts/0 Sl 18:27 0:40 /usr/bin/python -u my_main_task_retrieval.py --local_ran root 42230 22.9 1.1 482679488 2783324 pts/0 Sl 18:27 0:47 /usr/bin/python -u my_main_task_retrieval.py --local_ran root 42231 21.4 1.1 482679492 2782736 pts/0 Sl 18:27 0:44 /usr/bin/python -u my_main_task_retrieval.py --local_ran root 50220 0.0 0.0 12868 3056 pts/2 Ss+ May16 0:00 bash root 50716 0.0 0.0 1607428 163784 pts/0 S 17:19 0:00 python -m torch.distributed.launch --nproc_per_node=8 -- root 50729 87.0 1.3 479805560 3355884 pts/0 Sl 17:19 61:50 /usr/bin/python -u my_main_task_retrieval.py --local_ran root 50730 98.6 1.3 479788428 3326112 pts/0 Rl 17:19 70:04 /usr/bin/python -u my_main_task_retrieval.py --local_ran root 50731 95.9 1.0 479054636 2569396 pts/0 Rl 17:19 68:11 /usr/bin/python -u my_main_task_retrieval.py --local_ran root 50732 93.6 1.0 479039996 2557408 pts/0 Rl 17:19 66:33 /usr/bin/python -u my_main_task_retrieval.py --local_ran root 50733 98.2 1.3 479788428 3303080 pts/0 Rl 17:19 69:50 /usr/bin/python -u my_main_task_retrieval.py --local_ran root 50734 97.5 1.3 479769584 3286080 pts/0 Rl 17:19 69:18 /usr/bin/python -u my_main_task_retrieval.py --local_ran root 50735 94.9 1.3 479769580 3285680 pts/0 Rl 17:19 67:29 /usr/bin/python -u my_main_task_retrieval.py --local_ran
Hi @AAUfoa, we tested multi-node before and everything was ok. So I am not sure what happened to different machines.
Unfortunately, I have no multi-node cluster to test it now :(.
Thank you for your timely help :)
From my recent experience, this is really a fantastic public project for beginer to get on the multi-modal works. We also have tried some other projects similar to yours (they also cited this work), I think this is the best one. Not only for the easy use, timely patient help, but also clear design of the network, which could be transfered to other tasks as well. Just curiously, I could not find it on any conferences or journals, why not try ? Good Luck!
Hi @AAUfoa, thanks for your attention. The paper is still under review (journal). Best~
@AAUfoa hello sorry to bother you, have you know where to download the weight of the clip4clip? I have not found it. Thanks!
- 0.0000000
How do you adjust the hyperparameters? Did you lower the learning rate? I'm also encountering issues with the loss not converging when training on a new dataset.
同时,我在2机8卡上进行分布式训练,按照正常的理解是每台机器8个进程,共16个(和GPU数一致),但跑这个工程的时候,每台机器先是启动了8个进程,接着又各派生了8个子进程,每台机器上存在16个进程,并且其中8个进程休眠,这个问题一直没有解决,想交流一下其他人是否遇到了这个问题?