ArrowLuo / CLIP4Clip

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"
https://arxiv.org/abs/2104.08860
MIT License
880 stars 124 forks source link

torch.distributed.init_process_group(backend="nccl") error and some other errors #73

Closed celestialxevermore closed 2 years ago

celestialxevermore commented 2 years ago

Dear author,

Thank you for helping me about 4 months before.

I'd tried to leave you recomment to your helping words, but my issue was already canceled. I really really feel sorry for that.

Now, I have another issues on my running process with MSVD datasets.

I successed to run your framework several time, but I got another problem on my environment these days.

If you don't mind, Can I burrow your hand one more time?

the following is my error log, and I can manage the GPU sever of two.

each separation means that the error log raised on each server.

If you need another request required for, leave me your comments.

Sincerely,

Server 16

(CLIP4Clip) key2317@super:~/CLIP4Clip$ main_task_retrieval.py DATA_PATH=/home/key2317/CLIP4Clip/msvd_data/ VISIBLE_DEVICES=3,4,0,5 python -m torch.distributed.launch --nproc_per_node=4 \ main_task_retrieval.py --do_train --num_thread_reader=2 \ --epochs=5 --batch_size=128 --n_display=50 \ --data_path ${DATA_PATH} \ --features_path ${DATA_PATH}/MSVD_Videos \ --output_dir ckpts/ckpt_msvd_retrieval_looseType \ --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \ --datatype msvd \ --feature_framerate 1 --coef_lr 1e-3 \ --freeze_layer_num 0 --slice_framepos 2 \ --loose_type --linear_patch 2d --sim_header meanP \ --pretrained_clip_name ViT-B/32main_task_retrieval.py: command not found (CLIP4Clip) key2317@super:~/CLIP4Clip$ CUDA_VISIBLE_DEVICES=3,4,0,5 python -m torch.distributed.launch --nproc_per_node=4 \

main_task_retrieval.py --do_train --num_thread_reader=2 \ --epochs=5 --batch_size=128 --n_display=50 \ --data_path ${DATA_PATH} \ --features_path ${DATA_PATH}/MSVD_Videos \ --output_dir ckpts/ckpt_msvd_retrieval_looseType \ --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \ --datatype msvd \ --feature_framerate 1 --coef_lr 1e-3 \ --freeze_layer_num 0 --slice_framepos 2 \ --loose_type --linear_patch 2d --sim_header meanP \ --pretrained_clip_name ViT-B/32 Traceback (most recent call last): File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/runpy.py", line 185, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/runpy.py", line 111, in _get_module_details import(pkg_name) File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/init.py", line 189, in _load_global_deps() File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/init.py", line 142, in _load_global_deps ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL) File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/ctypes/init.py", line 373, in init self._handle = _dlopen(self._name, mode) OSError: /home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/lib/../../../../libcublas.so.11: undefined symbol: free_gemm_select, version libcublasLt.so.11 (CLIP4Clip) key2317@super:~/CLIP4Clip$

Server 19

(CLIP4Clip) key2317@ubuntu:~/video-multimodal/CLIP4Clip$ python -m torch.distributed.launch --nproc_per_node=4 \

main_task_retrieval.py --do_train --num_thread_reader=2 \ --epochs=5 --batch_size=128 --n_display=50 \ --data_path ${DATA_PATH} \ --features_path ${DATA_PATH}/MSVD_Videos \ --output_dir ckpts/ckpt_msvd_retrieval_looseType \ --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \ --datatype msvd \ --feature_framerate 1 --coef_lr 1e-3 \ --freeze_layer_num 0 --slice_framepos 2 \ --loose_type --linear_patch 2d --sim_header meanP \ --pretrained_clip_name ViT-B/32

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Traceback (most recent call last): File "main_task_retrieval.py", line 29, in torch.distributed.init_process_group(backend="nccl") File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout) RuntimeError: Address already in use Traceback (most recent call last): File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in main() File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['/home/key2317/anaconda3/envs/CLIP4Clip/bin/python', '-u', 'main_task_retrieval.py', '--local_rank=3', '--do_train', '--num_thread_reader=2', '--epochs=5', '--batch_size=128', '--n_display=50', '--data_path', '--features_path', '/MSVD_Videos', '--output_dir', 'ckpts/ckpt_msvd_retrieval_looseType', '--lr', '1e-4', '--max_words', '32', '--max_frames', '12', '--batch_size_val', '16', '--datatype', 'msvd', '--feature_framerate', '1', '--coef_lr', '1e-3', '--freeze_layer_num', '0', '--slice_framepos', '2', '--loose_type', '--linear_patch', '2d', '--sim_header', 'meanP', '--pretrained_clip_name', 'ViT-B/32']' returned non-zero exit status 1. (CLIP4Clip) key2317@ubuntu:~/video-multimodal/CLIP4Clip$ Traceback (most recent call last): File "main_task_retrieval.py", line 29, in torch.distributed.init_process_group(backend="nccl") File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group barrier() File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier work = _default_pg.barrier() RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370172916/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8 Traceback (most recent call last): File "main_task_retrieval.py", line 29, in torch.distributed.init_process_group(backend="nccl") File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group barrier() File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier work = _default_pg.barrier() RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370172916/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8 Traceback (most recent call last): File "main_task_retrieval.py", line 29, in torch.distributed.init_process_group(backend="nccl") File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group barrier() File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier work = _default_pg.barrier() RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370172916/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8

ArrowLuo commented 2 years ago

Hi @celestialxevermore, thanks for your attention. In Server 19, the error is Address already in use, which suggests that you'd better kill the previous process in GPUs manually and rerun it. In Server 16, I think the error is caused by your torch env, but not the code. You can refer to here from the error of undefined symbol: free_gemm_select, version libcublasLt.so.11.

celestialxevermore commented 2 years ago

@ArrowLuo Dear Author, Really appreciate for your recomment. I try it right away!

celestialxevermore commented 2 years ago

@ArrowLuo Dear Author, After created conda virtual env, then I pasted this line : pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio===0.7.2 -f https://download.pytorch.org/whl/torch_stable.html , which was written on your recommended github site.

Then, I did run my code, but there is no found any process sign being executed, like this. (CLIP4Clip) key2317@super:~/CLIP4Clip$ main_task_retrieval.py DATA_PATH=/home/key2317/CLIP4Clip/msvd_data/ VISIBLE_DEVICES=3,4,0,5 python -m torch.distributed.launch --nproc_per_node=4 \ main_task_retrieval.py --do_train --num_thread_reader=2 \ --epochs=5 --batch_size=128 --n_display=50 \ --data_path ${DATA_PATH} \ --features_path ${DATA_PATH}/MSVD_Videos \ --output_dir ckpts/ckpt_msvd_retrieval_looseType \ --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \ --datatype msvd \ --feature_framerate 1 --coef_lr 1e-3 \ --freeze_layer_num 0 --slice_framepos 2 \ --loose_type --linear_patch 2d --sim_header meanP \ --pretrained_clip_name ViT-B/32main_task_retrieval.py: command not found (CLIP4Clip) key2317@super:~/CLIP4Clip$ CUDA_VISIBLE_DEVICES=3,4,0,5 python -m torch.distributed.launch --nproc_per_node=4 \

main_task_retrieval.py --do_train --num_thread_reader=2 \ --epochs=5 --batch_size=128 --n_display=50 \ --data_path ${DATA_PATH} \ --features_path ${DATA_PATH}/MSVD_Videos \ --output_dir ckpts/ckpt_msvd_retrieval_looseType \ --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \ --datatype msvd \ --feature_framerate 1 --coef_lr 1e-3 \ --freeze_layer_num 0 --slice_framepos 2 \ --loose_type --linear_patch 2d --sim_header meanP \ --pretrained_clip_name ViT-B/32

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

The spec of server 16 is following. +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A10 On | 00000000:4F:00.0 Off | 0 | | 0% 45C P0 61W / 150W | 2733MiB / 23028MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A10 On | 00000000:52:00.0 Off | 0 | | 0% 47C P0 64W / 150W | 1379MiB / 23028MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A10 On | 00000000:57:00.0 Off | 0 | | 0% 47C P0 64W / 150W | 1379MiB / 23028MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA RTX A6000 On | 00000000:D1:00.0 Off | Off | | 30% 42C P2 87W / 300W | 1422MiB / 49140MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA RTX A5000 On | 00000000:D5:00.0 Off | Off | | 30% 50C P2 86W / 230W | 1350MiB / 24564MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA RTX A6000 On | 00000000:D6:00.0 Off | Off | | 40% 67C P2 228W / 300W | 40946MiB / 49140MiB | 51% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

ArrowLuo commented 2 years ago

Hi @celestialxevermore, how about the log? You can see whether the Memory-Usage is always zero. I am not sure the reason why there is no process. Maybe top is also useful.

celestialxevermore commented 2 years ago

@ArrowLuo Ok, Thx. I will check again. The log is stopped


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


here. Then No response after that at all:( Anyway, my other teammate also tries to find what is wrong. Thank you Sir!