Closed celestialxevermore closed 2 years ago
Hi @celestialxevermore, thanks for your attention. In Server 19, the error is Address already in use
, which suggests that you'd better kill the previous process in GPUs manually and rerun it. In Server 16, I think the error is caused by your torch env, but not the code. You can refer to here from the error of undefined symbol: free_gemm_select, version libcublasLt.so.11
.
@ArrowLuo Dear Author, Really appreciate for your recomment. I try it right away!
@ArrowLuo Dear Author, After created conda virtual env, then I pasted this line : pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio===0.7.2 -f https://download.pytorch.org/whl/torch_stable.html , which was written on your recommended github site.
Then, I did run my code, but there is no found any process sign being executed, like this. (CLIP4Clip) key2317@super:~/CLIP4Clip$ main_task_retrieval.py DATA_PATH=/home/key2317/CLIP4Clip/msvd_data/ VISIBLE_DEVICES=3,4,0,5 python -m torch.distributed.launch --nproc_per_node=4 \ main_task_retrieval.py --do_train --num_thread_reader=2 \ --epochs=5 --batch_size=128 --n_display=50 \ --data_path ${DATA_PATH} \ --features_path ${DATA_PATH}/MSVD_Videos \ --output_dir ckpts/ckpt_msvd_retrieval_looseType \ --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \ --datatype msvd \ --feature_framerate 1 --coef_lr 1e-3 \ --freeze_layer_num 0 --slice_framepos 2 \ --loose_type --linear_patch 2d --sim_header meanP \ --pretrained_clip_name ViT-B/32main_task_retrieval.py: command not found (CLIP4Clip) key2317@super:~/CLIP4Clip$ CUDA_VISIBLE_DEVICES=3,4,0,5 python -m torch.distributed.launch --nproc_per_node=4 \
main_task_retrieval.py --do_train --num_thread_reader=2 \ --epochs=5 --batch_size=128 --n_display=50 \ --data_path ${DATA_PATH} \ --features_path ${DATA_PATH}/MSVD_Videos \ --output_dir ckpts/ckpt_msvd_retrieval_looseType \ --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \ --datatype msvd \ --feature_framerate 1 --coef_lr 1e-3 \ --freeze_layer_num 0 --slice_framepos 2 \ --loose_type --linear_patch 2d --sim_header meanP \ --pretrained_clip_name ViT-B/32
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
The spec of server 16 is following. +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A10 On | 00000000:4F:00.0 Off | 0 | | 0% 45C P0 61W / 150W | 2733MiB / 23028MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A10 On | 00000000:52:00.0 Off | 0 | | 0% 47C P0 64W / 150W | 1379MiB / 23028MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A10 On | 00000000:57:00.0 Off | 0 | | 0% 47C P0 64W / 150W | 1379MiB / 23028MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA RTX A6000 On | 00000000:D1:00.0 Off | Off | | 30% 42C P2 87W / 300W | 1422MiB / 49140MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA RTX A5000 On | 00000000:D5:00.0 Off | Off | | 30% 50C P2 86W / 230W | 1350MiB / 24564MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA RTX A6000 On | 00000000:D6:00.0 Off | Off | | 40% 67C P2 228W / 300W | 40946MiB / 49140MiB | 51% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
Hi @celestialxevermore, how about the log? You can see whether the Memory-Usage
is always zero. I am not sure the reason why there is no process. Maybe top
is also useful.
@ArrowLuo Ok, Thx. I will check again. The log is stopped
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
here. Then No response after that at all:( Anyway, my other teammate also tries to find what is wrong. Thank you Sir!
Dear author,
Thank you for helping me about 4 months before.
I'd tried to leave you recomment to your helping words, but my issue was already canceled. I really really feel sorry for that.
Now, I have another issues on my running process with MSVD datasets.
I successed to run your framework several time, but I got another problem on my environment these days.
If you don't mind, Can I burrow your hand one more time?
the following is my error log, and I can manage the GPU sever of two.
each separation means that the error log raised on each server.
If you need another request required for, leave me your comments.
Sincerely,
Server 16
(CLIP4Clip) key2317@super:~/CLIP4Clip$ main_task_retrieval.py DATA_PATH=/home/key2317/CLIP4Clip/msvd_data/ VISIBLE_DEVICES=3,4,0,5 python -m torch.distributed.launch --nproc_per_node=4 \ main_task_retrieval.py --do_train --num_thread_reader=2 \ --epochs=5 --batch_size=128 --n_display=50 \ --data_path ${DATA_PATH} \ --features_path ${DATA_PATH}/MSVD_Videos \ --output_dir ckpts/ckpt_msvd_retrieval_looseType \ --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \ --datatype msvd \ --feature_framerate 1 --coef_lr 1e-3 \ --freeze_layer_num 0 --slice_framepos 2 \ --loose_type --linear_patch 2d --sim_header meanP \ --pretrained_clip_name ViT-B/32main_task_retrieval.py: command not found (CLIP4Clip) key2317@super:~/CLIP4Clip$ CUDA_VISIBLE_DEVICES=3,4,0,5 python -m torch.distributed.launch --nproc_per_node=4 \
Server 19
(CLIP4Clip) key2317@ubuntu:~/video-multimodal/CLIP4Clip$ python -m torch.distributed.launch --nproc_per_node=4 \
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Traceback (most recent call last): File "main_task_retrieval.py", line 29, in
torch.distributed.init_process_group(backend="nccl")
File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/key2317/anaconda3/envs/CLIP4Clip/bin/python', '-u', 'main_task_retrieval.py', '--local_rank=3', '--do_train', '--num_thread_reader=2', '--epochs=5', '--batch_size=128', '--n_display=50', '--data_path', '--features_path', '/MSVD_Videos', '--output_dir', 'ckpts/ckpt_msvd_retrieval_looseType', '--lr', '1e-4', '--max_words', '32', '--max_frames', '12', '--batch_size_val', '16', '--datatype', 'msvd', '--feature_framerate', '1', '--coef_lr', '1e-3', '--freeze_layer_num', '0', '--slice_framepos', '2', '--loose_type', '--linear_patch', '2d', '--sim_header', 'meanP', '--pretrained_clip_name', 'ViT-B/32']' returned non-zero exit status 1.
(CLIP4Clip) key2317@ubuntu:~/video-multimodal/CLIP4Clip$ Traceback (most recent call last):
File "main_task_retrieval.py", line 29, in
torch.distributed.init_process_group(backend="nccl")
File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370172916/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8
Traceback (most recent call last):
File "main_task_retrieval.py", line 29, in
torch.distributed.init_process_group(backend="nccl")
File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370172916/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8
Traceback (most recent call last):
File "main_task_retrieval.py", line 29, in
torch.distributed.init_process_group(backend="nccl")
File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/key2317/anaconda3/envs/CLIP4Clip/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370172916/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8