ArrowLuo / CLIP4Clip

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"
https://arxiv.org/abs/2104.08860
MIT License
851 stars 121 forks source link

Some errors : subprocess.CalledProcessError #61

Closed celestialxevermore closed 2 years ago

celestialxevermore commented 2 years ago

Dear Author, really thanks you all for opening this open source.

I'm a novice and do not have much techniques in dealing with and running all kinds of framework, so today, I've been suffered from error :

subprocess.CalledProcessError: Command '['/home/key2317/anaconda3/envs/CLIP4Clip/bin/python', '-u', 'main_task_retrieval.py', '--local_rank=3', '--do_train', '--num_thread_reader=0', '--epochs=5', '--batch_size=128', '--n_display=50', '--train_csv', '/home/key2317/CLIP4Clip/msrvtt_data/MSRVTT_train.9k.csv', '--val_csv', '/home/key2317/CLIP4Clip/msrvtt_data/MSRVTT_JSFUSION_test.csv', '--data_path', '/home/key2317/CLIP4Clip/msrvtt_data/MSRVTT_data.json', '--features_path', '/home/key2317/CLIP4Clip/msrvtt_data/MSRVTT_Videos', '--output_dir', 'ckpts/ckpt_msrvtt_retrieval_looseType', '--lr', '1e-4', '--max_words', '32', '--max_frames', '12', '--batch_size_val', '16', '--datatype', 'msrvtt', '--expand_msrvtt_sentences', '--feature_framerate', '1', '--coef_lr', '1e-3', '--freeze_layer_num', '0', '--slice_framepos', '2', '--loose_type', '--linear_patch', '2d', '--sim_header', 'meanP', '--pretrained_clip_name', 'ViT-B/32']' returned non-zero exit status 1.

From entering my starting command, It seemed to run well for about 5 seconds showing inner state like vison_layers: 12, vision_width : 768, blah blah blah. But unfortunately, It was all ended up with aforementioned messages.

One of my colleague guessed that It would be problem in our unmatching issue in GPU environments, in short, GPU problems, but I'm not sure how can I diagnose my problem.

I am really interested in your papers, also codes, but, because of this initial step, I cannot go for next. Would you please help me?

Thanks.

ArrowLuo commented 2 years ago

@celestialevermore, Thanks for your interest. Can you print your command here? I can not locate your problem with the above information. I am also confused that what caused a subprocess.CalledProcessError.

celestialxevermore commented 2 years ago

Sure. The following line is my command.

python -m torch.distributed.launch --nproc_per_node=4 \ main_task_retrieval.py --do_train --num_thread_reader=0 \ --epochs=5 --batch_size=128 --n_display=50 \ --train_csv /home/key2317/CLIP4Clip/msrvtt_data/MSRVTT_train.9k.csv \ --val_csv /home/key2317/CLIP4Clip/msrvtt_data/MSRVTT_JSFUSION_test.csv \ --data_path /home/key2317/CLIP4Clip/msrvtt_data/MSRVTT_data.json \ --features_path /home/key2317/CLIP4Clip/msrvtt_data/MSRVTT_Videos \ --output_dir ckpts/ckpt_msrvtt_retrieval_looseType \ --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \ --datatype msrvtt --expand_msrvtt_sentences \ --feature_framerate 1 --coef_lr 1e-3 \ --freeze_layer_num 0 --slice_framepos 2 \ --loose_type --linear_patch 2d --sim_header meanP \ --pretrained_clip_name ViT-B/32

ArrowLuo commented 2 years ago

Hi @celestialevermore, I do not think there is something wrong with this command, but why it will cause the error ', '--local_rank=3', '--do_train', '-, which contains --local_rank=3. Can you print your nvidia-smi here?

celestialxevermore commented 2 years ago

I really thank you for your immediate reply first. Really Thanks.

Thu Feb 24 17:19:12 2022
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:1A:00.0 Off | N/A | | 33% 36C P8 4W / 250W | 3833MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... On | 00000000:1B:00.0 Off | N/A | | 22% 32C P8 20W / 250W | 1184MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce ... On | 00000000:3D:00.0 Off | N/A | | 22% 28C P8 2W / 250W | 2385MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce ... On | 00000000:3E:00.0 Off | N/A | | 22% 37C P8 11W / 250W | 3MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA GeForce ... On | 00000000:88:00.0 Off | N/A | | 22% 28C P8 22W / 250W | 5083MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA GeForce ... On | 00000000:89:00.0 Off | N/A | | 22% 32C P8 19W / 250W | 3MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 NVIDIA GeForce ... On | 00000000:B1:00.0 Off | N/A | | 22% 33C P8 18W / 250W | 3MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA GeForce ... On | 00000000:B2:00.0 Off | N/A | | 22% 32C P8 17W / 250W | 1202MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 56702 C ...s/dialogue_new/bin/python 1915MiB | | 0 N/A N/A 67815 C ...s/dialogue_new/bin/python 1915MiB | | 1 N/A N/A 33373 C ...onda3/envs/gnn/bin/python 1181MiB | | 2 N/A N/A 39629 C ...onda3/envs/gnn/bin/python 1181MiB | | 2 N/A N/A 58176 C ...onda3/envs/gnn/bin/python 1201MiB | | 4 N/A N/A 20230 C ...da3/envs/evcf2/bin/python 2539MiB | | 4 N/A N/A 46449 C ...da3/envs/evcf2/bin/python 2539MiB | | 7 N/A N/A 60275 C ...vs/pytorch_env/bin/python 1199MiB | +-----------------------------------------------------------------------------+

ArrowLuo commented 2 years ago

Hi @celestialevermore, can you add CUDA_VISIBLE_DEVICES=0,1,2,3 before your command to try it again? Or have you added it?

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 ...