Closed celestialxevermore closed 2 years ago
Hi @celestialxevermore, I hope you have solved this problem. Because I have no idea about it. Maybe you can change the torch to a lower version, e.g., 1.7.0, to test. Good luck~
I met the same issue, while the code can train on msrvtt normally, I wonder have you solve the problem, i would highly appreciated if you can share your method with me @celestialxevermore
@weiwuxian1998 I'm not sure because it's been too long, but I have tried to match the version of cudatoolkit and also pytorch.
Dear author, I hope you have a good day.
With your deep help, I almost succeeded to train, but, I think that there is another issues on it.
Before start begging your help again, I commnet my examine spec on here.
I got the two GPUs only, so --nproc_per_node=2.
Running code. DATAPATH=/home/key2317/CLIP4Clip/msvd_data python -m torch.distributed.launch --nproc_per_node=2 \ main_task_retrieval.py --do_train --num_thread_reader=2 \ --epochs=5 --batch_size=128 --n_display=50 \ --data_path ${DATA_PATH} \ --features_path ${DATA_PATH}/MSVD_Videos \ --output_dir ckpts/ckpt_msvd_retrieval_looseType \ --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \ --datatype msvd \ --feature_framerate 1 --coef_lr 1e-3 \ --freeze_layer_num 0 --slice_framepos 2 \ --loose_type --linear_patch 2d --sim_header meanP \ --pretrained_clip_name ViT-B/32
torch, cuda version torch : 1.11.0 cudatoolkit 11.3.1 -> I saw nothing in version issue.
Error Log.
(CLIP4Clip) key2317@super:~/CLIP4Clip$ DATAPATH=/home/key2317/CLIP4Clip/msvddata (CLIP4Clip) key2317@super:~/CLIP4Clip_$ python -m torch.distributed.launch --nproc_per_node=2 \
/home/key2317/anaconda3/envs/CLIP4Clip_/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects
--local_rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionswarnings.warn( WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
06/09/2022 02:14:54 - INFO - device: cuda:1 n_gpu: 2 06/09/2022 02:14:54 - INFO - Effective parameters: 06/09/2022 02:14:54 - INFO - <<< batch_size: 128 06/09/2022 02:14:54 - INFO - <<< batch_size_val: 16 06/09/2022 02:14:54 - INFO - <<< cache_dir: 06/09/2022 02:14:54 - INFO - <<< coef_lr: 0.001 06/09/2022 02:14:54 - INFO - <<< cross_model: cross-base 06/09/2022 02:14:54 - INFO - <<< cross_num_hidden_layers: 4 06/09/2022 02:14:54 - INFO - <<< datapath: /home/key2317/CLIP4Clip/msvd_data 06/09/2022 02:14:54 - INFO - <<< datatype: msvd 06/09/2022 02:14:54 - INFO - <<< do_eval: False 06/09/2022 02:14:54 - INFO - <<< do_lower_case: False 06/09/2022 02:14:54 - INFO - <<< do_pretrain: False 06/09/2022 02:14:54 - INFO - <<< do_train: True 06/09/2022 02:14:54 - INFO - <<< epochs: 5 06/09/2022 02:14:54 - INFO - <<< eval_frame_order: 0 06/09/2022 02:14:54 - INFO - <<< expand_msrvtt_sentences: False 06/09/2022 02:14:54 - INFO - <<< feature_framerate: 1 06/09/2022 02:14:54 - INFO - <<< featurespath: /home/key2317/CLIP4Clip/msvd_data/MSVD_Videos 06/09/2022 02:14:54 - INFO - <<< fp16: False 06/09/2022 02:14:54 - INFO - <<< fp16_opt_level: O1 06/09/2022 02:14:54 - INFO - <<< freeze_layer_num: 0 06/09/2022 02:14:54 - INFO - <<< gradient_accumulation_steps: 1 06/09/2022 02:14:54 - INFO - <<< hard_negative_rate: 0.5 06/09/2022 02:14:54 - INFO - <<< init_model: None 06/09/2022 02:14:54 - INFO - <<< linear_patch: 2d 06/09/2022 02:14:54 - INFO - <<< local_rank: 0 06/09/2022 02:14:54 - INFO - <<< loose_type: True 06/09/2022 02:14:54 - INFO - <<< lr: 0.0001 06/09/2022 02:14:54 - INFO - <<< lr_decay: 0.9 06/09/2022 02:14:54 - INFO - <<< margin: 0.1 06/09/2022 02:14:54 - INFO - <<< max_frames: 12 06/09/2022 02:14:54 - INFO - <<< max_words: 32 06/09/2022 02:14:54 - INFO - <<< n_display: 50 06/09/2022 02:14:54 - INFO - <<< n_gpu: 1 06/09/2022 02:14:54 - INFO - <<< n_pair: 1 06/09/2022 02:14:54 - INFO - <<< negative_weighting: 1 06/09/2022 02:14:54 - INFO - <<< num_thread_reader: 2 06/09/2022 02:14:54 - INFO - <<< output_dir: ckpts/ckpt_msvd_retrieval_looseType 06/09/2022 02:14:54 - INFO - <<< pretrained_clip_name: ViT-B/32 06/09/2022 02:14:54 - INFO - <<< rank: 0 06/09/2022 02:14:54 - INFO - <<< resume_model: None 06/09/2022 02:14:54 - INFO - <<< sampled_use_mil: False 06/09/2022 02:14:54 - INFO - <<< seed: 42 06/09/2022 02:14:54 - INFO - <<< sim_header: meanP 06/09/2022 02:14:54 - INFO - <<< slice_framepos: 2 06/09/2022 02:14:54 - INFO - <<< task_type: retrieval 06/09/2022 02:14:54 - INFO - <<< text_num_hidden_layers: 12 06/09/2022 02:14:54 - INFO - <<< train_csv: data/.train.csv 06/09/2022 02:14:54 - INFO - <<< train_frame_order: 0 06/09/2022 02:14:54 - INFO - <<< use_mil: False 06/09/2022 02:14:54 - INFO - <<< val_csv: data/.val.csv 06/09/2022 02:14:54 - INFO - <<< video_dim: 1024 06/09/2022 02:14:54 - INFO - <<< visual_num_hidden_layers: 12 06/09/2022 02:14:54 - INFO - <<< warmup_proportion: 0.1 06/09/2022 02:14:54 - INFO - <<< world_size: 2 06/09/2022 02:14:54 - INFO - device: cuda:0 ngpu: 2 <<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 06/09/2022 02:14:55 - INFO - loading archive file /home/key2317/CLIP4Clip/modules/cross-base 06/09/2022 02:14:55 - INFO - Model config { "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 512, "initializer_range": 0.02, "intermediate_size": 2048, "max_position_embeddings": 128, "num_attention_heads": 8, "num_hidden_layers": 4, "type_vocab_size": 2, "vocab_size": 512 }
06/09/2022 02:14:55 - INFO - Weight doesn't exsits. /home/key2317/CLIP4Clip_/modules/cross-base/cross_pytorch_model.bin 06/09/2022 02:14:55 - WARNING - Stage-One:True, Stage-Two:False 06/09/2022 02:14:55 - WARNING - Test retrieval by loose type. 06/09/2022 02:14:55 - WARNING - embed_dim: 512 06/09/2022 02:14:55 - WARNING - image_resolution: 224 06/09/2022 02:14:55 - WARNING - vision_layers: 12 06/09/2022 02:14:55 - WARNING - vision_width: 768 06/09/2022 02:14:55 - WARNING - vision_patch_size: 32 06/09/2022 02:14:55 - WARNING - context_length: 77 06/09/2022 02:14:55 - WARNING - vocab_size: 49408 06/09/2022 02:14:55 - WARNING - transformer_width: 512 06/09/2022 02:14:55 - WARNING - transformer_heads: 8 06/09/2022 02:14:55 - WARNING - transformer_layers: 12 06/09/2022 02:14:55 - WARNING - linear_patch: 2d 06/09/2022 02:14:55 - WARNING - cut_top_layer: 0 <<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>
<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>
06/09/2022 02:14:57 - WARNING - sim_header: meanP
<<<<<<<<<<<<<<<<<<<<<< before to device >>>>>>>>>>>>>>>>>>>>>>>>>>
06/09/2022 02:15:03 - INFO - --------------------
06/09/2022 02:15:03 - INFO - Weights from pretrained model not used in CLIP4Clip:
clip.input_resolution
clip.context_length
clip.vocabsize
<<<<<<<<<<<<<<<<<<<<<< before to device >>>>>>>>>>>>>>>>>>>>>>>>>>
<<<<<<<<<<<<<<<<<<<<<< after to device >>>>>>>>>>>>>>>>>>>>>>>>>>
For test, sentence number: 27763
For test, video number: 670
Video number: 670
Total Paire: 27763
/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
warnings.warn(
<<<<<<<<<<<<<<<<<<<<<< after to device >>>>>>>>>>>>>>>>>>>>>>>>>>
For val, sentence number: 4290
For val, video number: 100
Video number: 100
Total Paire: 4290
For test, sentence number: 27763
For test, video number: 670
Video number: 670
Total Paire: 27763
/home/key2317/anaconda3/envs/CLIP4Clip_/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
warnings.warn(
Video number: 1200
Total Paire: 48774
<<<<<<<<<<<<<<<<<<<<<< prep_optimizer >>>>>>>>>>>>>>>>>>>>>>>>>>>>
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< local rank : [1],1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
For val, sentence number: 4290
For val, video number: 100
Video number: 100
Total Paire: 4290
06/09/2022 02:15:06 - INFO - Running test
06/09/2022 02:15:06 - INFO - Num examples = 27763
06/09/2022 02:15:06 - INFO - Batch size = 16
06/09/2022 02:15:06 - INFO - Num steps = 1736
06/09/2022 02:15:06 - INFO - Running val
06/09/2022 02:15:06 - INFO - Num examples = 4290
Video number: 1200
Total Paire: 48774
<<<<<<<<<<<<<<<<<<<<<< prep_optimizer >>>>>>>>>>>>>>>>>>>>>>>>>>>>
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< local rank : [0],0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
<<<<<<<<<<<<<<<<<<<<<<<<<<<<< ddp 오류 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
06/09/2022 02:15:07 - INFO - Running training
06/09/2022 02:15:07 - INFO - Num examples = 48774
06/09/2022 02:15:07 - INFO - Batch size = 128
06/09/2022 02:15:07 - INFO - Num steps = 1905
[E ProcessGroupNCCL.cpp:719] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1807965 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:719] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808017 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 507951 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) localrank: 1 (pid: 507952) of binary: /home/key2317/anaconda3/envs/CLIP4Clip/bin/python
Traceback (most recent call last):
File "/home/key2317/anaconda3/envs/CLIP4Clip_/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, mainglobals, None,
File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, runglobals)
File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/key2317/anaconda3/envs/CLIP4Clip_/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elasticlaunch(
File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self.entrypoint, list(args))
File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
main_task_retrieval.py FAILED
Failures: