ArrowLuo / CLIP4Clip

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"
https://arxiv.org/abs/2104.08860
MIT License
888 stars 125 forks source link

Some new issues about nccl : torch.distributed.elastic.multiprocessing.errors.ChildFailedError : #78

Closed celestialxevermore closed 2 years ago

celestialxevermore commented 2 years ago

Dear author, I hope you have a good day.

With your deep help, I almost succeeded to train, but, I think that there is another issues on it.

Before start begging your help again, I commnet my examine spec on here.

  1. GPU Server +-------------------------------+----------------------+----------------------+ | 4 NVIDIA RTX A6000 On | 00000000:D1:00.0 Off | Off | | 30% 31C P8 16W / 300W | 3MiB / 49140MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA RTX A6000 On | 00000000:D5:00.0 Off | Off | | 30% 32C P8 23W / 300W | 3MiB / 49140MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

I got the two GPUs only, so --nproc_per_node=2.

  1. Running code. DATAPATH=/home/key2317/CLIP4Clip/msvd_data python -m torch.distributed.launch --nproc_per_node=2 \ main_task_retrieval.py --do_train --num_thread_reader=2 \ --epochs=5 --batch_size=128 --n_display=50 \ --data_path ${DATA_PATH} \ --features_path ${DATA_PATH}/MSVD_Videos \ --output_dir ckpts/ckpt_msvd_retrieval_looseType \ --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \ --datatype msvd \ --feature_framerate 1 --coef_lr 1e-3 \ --freeze_layer_num 0 --slice_framepos 2 \ --loose_type --linear_patch 2d --sim_header meanP \ --pretrained_clip_name ViT-B/32

  2. torch, cuda version torch : 1.11.0 cudatoolkit 11.3.1 -> I saw nothing in version issue.

  3. Error Log.

(CLIP4Clip) key2317@super:~/CLIP4Clip$ DATAPATH=/home/key2317/CLIP4Clip/msvddata (CLIP4Clip) key2317@super:~/CLIP4Clip_$ python -m torch.distributed.launch --nproc_per_node=2 \

main_task_retrieval.py --do_train --num_thread_reader=2 \ --epochs=5 --batch_size=128 --n_display=50 \ --data_path ${DATA_PATH} \ --features_path ${DATA_PATH}/MSVD_Videos \ --output_dir ckpts/ckpt_msvd_retrieval_looseType \ --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \ --datatype msvd \ --feature_framerate 1 --coef_lr 1e-3 \ --freeze_layer_num 0 --slice_framepos 2 \ --loose_type --linear_patch 2d --sim_header meanP \ --pretrained_clip_name ViT-B/32

/home/key2317/anaconda3/envs/CLIP4Clip_/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


06/09/2022 02:14:54 - INFO - device: cuda:1 n_gpu: 2 06/09/2022 02:14:54 - INFO - Effective parameters: 06/09/2022 02:14:54 - INFO - <<< batch_size: 128 06/09/2022 02:14:54 - INFO - <<< batch_size_val: 16 06/09/2022 02:14:54 - INFO - <<< cache_dir: 06/09/2022 02:14:54 - INFO - <<< coef_lr: 0.001 06/09/2022 02:14:54 - INFO - <<< cross_model: cross-base 06/09/2022 02:14:54 - INFO - <<< cross_num_hidden_layers: 4 06/09/2022 02:14:54 - INFO - <<< datapath: /home/key2317/CLIP4Clip/msvd_data 06/09/2022 02:14:54 - INFO - <<< datatype: msvd 06/09/2022 02:14:54 - INFO - <<< do_eval: False 06/09/2022 02:14:54 - INFO - <<< do_lower_case: False 06/09/2022 02:14:54 - INFO - <<< do_pretrain: False 06/09/2022 02:14:54 - INFO - <<< do_train: True 06/09/2022 02:14:54 - INFO - <<< epochs: 5 06/09/2022 02:14:54 - INFO - <<< eval_frame_order: 0 06/09/2022 02:14:54 - INFO - <<< expand_msrvtt_sentences: False 06/09/2022 02:14:54 - INFO - <<< feature_framerate: 1 06/09/2022 02:14:54 - INFO - <<< featurespath: /home/key2317/CLIP4Clip/msvd_data/MSVD_Videos 06/09/2022 02:14:54 - INFO - <<< fp16: False 06/09/2022 02:14:54 - INFO - <<< fp16_opt_level: O1 06/09/2022 02:14:54 - INFO - <<< freeze_layer_num: 0 06/09/2022 02:14:54 - INFO - <<< gradient_accumulation_steps: 1 06/09/2022 02:14:54 - INFO - <<< hard_negative_rate: 0.5 06/09/2022 02:14:54 - INFO - <<< init_model: None 06/09/2022 02:14:54 - INFO - <<< linear_patch: 2d 06/09/2022 02:14:54 - INFO - <<< local_rank: 0 06/09/2022 02:14:54 - INFO - <<< loose_type: True 06/09/2022 02:14:54 - INFO - <<< lr: 0.0001 06/09/2022 02:14:54 - INFO - <<< lr_decay: 0.9 06/09/2022 02:14:54 - INFO - <<< margin: 0.1 06/09/2022 02:14:54 - INFO - <<< max_frames: 12 06/09/2022 02:14:54 - INFO - <<< max_words: 32 06/09/2022 02:14:54 - INFO - <<< n_display: 50 06/09/2022 02:14:54 - INFO - <<< n_gpu: 1 06/09/2022 02:14:54 - INFO - <<< n_pair: 1 06/09/2022 02:14:54 - INFO - <<< negative_weighting: 1 06/09/2022 02:14:54 - INFO - <<< num_thread_reader: 2 06/09/2022 02:14:54 - INFO - <<< output_dir: ckpts/ckpt_msvd_retrieval_looseType 06/09/2022 02:14:54 - INFO - <<< pretrained_clip_name: ViT-B/32 06/09/2022 02:14:54 - INFO - <<< rank: 0 06/09/2022 02:14:54 - INFO - <<< resume_model: None 06/09/2022 02:14:54 - INFO - <<< sampled_use_mil: False 06/09/2022 02:14:54 - INFO - <<< seed: 42 06/09/2022 02:14:54 - INFO - <<< sim_header: meanP 06/09/2022 02:14:54 - INFO - <<< slice_framepos: 2 06/09/2022 02:14:54 - INFO - <<< task_type: retrieval 06/09/2022 02:14:54 - INFO - <<< text_num_hidden_layers: 12 06/09/2022 02:14:54 - INFO - <<< train_csv: data/.train.csv 06/09/2022 02:14:54 - INFO - <<< train_frame_order: 0 06/09/2022 02:14:54 - INFO - <<< use_mil: False 06/09/2022 02:14:54 - INFO - <<< val_csv: data/.val.csv 06/09/2022 02:14:54 - INFO - <<< video_dim: 1024 06/09/2022 02:14:54 - INFO - <<< visual_num_hidden_layers: 12 06/09/2022 02:14:54 - INFO - <<< warmup_proportion: 0.1 06/09/2022 02:14:54 - INFO - <<< world_size: 2 06/09/2022 02:14:54 - INFO - device: cuda:0 ngpu: 2 <<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 06/09/2022 02:14:55 - INFO - loading archive file /home/key2317/CLIP4Clip/modules/cross-base 06/09/2022 02:14:55 - INFO - Model config { "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 512, "initializer_range": 0.02, "intermediate_size": 2048, "max_position_embeddings": 128, "num_attention_heads": 8, "num_hidden_layers": 4, "type_vocab_size": 2, "vocab_size": 512 }

06/09/2022 02:14:55 - INFO - Weight doesn't exsits. /home/key2317/CLIP4Clip_/modules/cross-base/cross_pytorch_model.bin 06/09/2022 02:14:55 - WARNING - Stage-One:True, Stage-Two:False 06/09/2022 02:14:55 - WARNING - Test retrieval by loose type. 06/09/2022 02:14:55 - WARNING - embed_dim: 512 06/09/2022 02:14:55 - WARNING - image_resolution: 224 06/09/2022 02:14:55 - WARNING - vision_layers: 12 06/09/2022 02:14:55 - WARNING - vision_width: 768 06/09/2022 02:14:55 - WARNING - vision_patch_size: 32 06/09/2022 02:14:55 - WARNING - context_length: 77 06/09/2022 02:14:55 - WARNING - vocab_size: 49408 06/09/2022 02:14:55 - WARNING - transformer_width: 512 06/09/2022 02:14:55 - WARNING - transformer_heads: 8 06/09/2022 02:14:55 - WARNING - transformer_layers: 12 06/09/2022 02:14:55 - WARNING - linear_patch: 2d 06/09/2022 02:14:55 - WARNING - cut_top_layer: 0 <<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>> <<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>> 06/09/2022 02:14:57 - WARNING - sim_header: meanP <<<<<<<<<<<<<<<<<<<<<< before to device >>>>>>>>>>>>>>>>>>>>>>>>>> 06/09/2022 02:15:03 - INFO - -------------------- 06/09/2022 02:15:03 - INFO - Weights from pretrained model not used in CLIP4Clip: clip.input_resolution clip.context_length clip.vocabsize <<<<<<<<<<<<<<<<<<<<<< before to device >>>>>>>>>>>>>>>>>>>>>>>>>> <<<<<<<<<<<<<<<<<<<<<< after to device >>>>>>>>>>>>>>>>>>>>>>>>>> For test, sentence number: 27763 For test, video number: 670 Video number: 670 Total Paire: 27763 /home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum. warnings.warn( <<<<<<<<<<<<<<<<<<<<<< after to device >>>>>>>>>>>>>>>>>>>>>>>>>> For val, sentence number: 4290 For val, video number: 100 Video number: 100 Total Paire: 4290 For test, sentence number: 27763 For test, video number: 670 Video number: 670 Total Paire: 27763 /home/key2317/anaconda3/envs/CLIP4Clip_/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum. warnings.warn( Video number: 1200 Total Paire: 48774 <<<<<<<<<<<<<<<<<<<<<< prep_optimizer >>>>>>>>>>>>>>>>>>>>>>>>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< local rank : [1],1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For val, sentence number: 4290 For val, video number: 100 Video number: 100 Total Paire: 4290 06/09/2022 02:15:06 - INFO - Running test 06/09/2022 02:15:06 - INFO - Num examples = 27763 06/09/2022 02:15:06 - INFO - Batch size = 16 06/09/2022 02:15:06 - INFO - Num steps = 1736 06/09/2022 02:15:06 - INFO - Running val 06/09/2022 02:15:06 - INFO - Num examples = 4290 Video number: 1200 Total Paire: 48774 <<<<<<<<<<<<<<<<<<<<<< prep_optimizer >>>>>>>>>>>>>>>>>>>>>>>>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< local rank : [0],0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<< ddp 오류 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 06/09/2022 02:15:07 - INFO - Running training 06/09/2022 02:15:07 - INFO - Num examples = 48774 06/09/2022 02:15:07 - INFO - Batch size = 128 06/09/2022 02:15:07 - INFO - Num steps = 1905 [E ProcessGroupNCCL.cpp:719] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1807965 milliseconds before timing out. [E ProcessGroupNCCL.cpp:719] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808017 milliseconds before timing out. [E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 507951 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) localrank: 1 (pid: 507952) of binary: /home/key2317/anaconda3/envs/CLIP4Clip/bin/python Traceback (most recent call last): File "/home/key2317/anaconda3/envs/CLIP4Clip_/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, mainglobals, None, File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/runpy.py", line 87, in _run_code exec(code, runglobals) File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/key2317/anaconda3/envs/CLIP4Clip_/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elasticlaunch( File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self.entrypoint, list(args)) File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main_task_retrieval.py FAILED

Failures:

------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2022-06-09_02:45:20 host : super rank : 1 (local_rank: 1) exitcode : -6 (pid: 507952) error_file: traceback : Signal 6 (SIGABRT) received by PID 507952 ======================================================= I recognize there are the wording like Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1807965 milliseconds before timing out. or Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. on it, I think that There are some issues on GPU, But I cannot find what is going on. There are some hand made log like <<<<<<<<<<<<<<<<<<< >>>>>>>>>>>>>>>>>>>>>, it means that I coded some print('<<<>>') for catching the stream. Thank you all days.
ArrowLuo commented 2 years ago

Hi @celestialxevermore, I hope you have solved this problem. Because I have no idea about it. Maybe you can change the torch to a lower version, e.g., 1.7.0, to test. Good luck~

weiwuxian1998 commented 1 year ago

I met the same issue, while the code can train on msrvtt normally, I wonder have you solve the problem, i would highly appreciated if you can share your method with me @celestialxevermore

celestialxevermore commented 1 year ago

@weiwuxian1998 I'm not sure because it's been too long, but I have tried to match the version of cudatoolkit and also pytorch.