Closed celestialxevermore closed 2 years ago
@celestialevermore, Thanks for your interest. Can you print your command here? I can not locate your problem with the above information. I am also confused that what caused a subprocess.CalledProcessError.
Sure. The following line is my command.
python -m torch.distributed.launch --nproc_per_node=4 \ main_task_retrieval.py --do_train --num_thread_reader=0 \ --epochs=5 --batch_size=128 --n_display=50 \ --train_csv /home/key2317/CLIP4Clip/msrvtt_data/MSRVTT_train.9k.csv \ --val_csv /home/key2317/CLIP4Clip/msrvtt_data/MSRVTT_JSFUSION_test.csv \ --data_path /home/key2317/CLIP4Clip/msrvtt_data/MSRVTT_data.json \ --features_path /home/key2317/CLIP4Clip/msrvtt_data/MSRVTT_Videos \ --output_dir ckpts/ckpt_msrvtt_retrieval_looseType \ --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \ --datatype msrvtt --expand_msrvtt_sentences \ --feature_framerate 1 --coef_lr 1e-3 \ --freeze_layer_num 0 --slice_framepos 2 \ --loose_type --linear_patch 2d --sim_header meanP \ --pretrained_clip_name ViT-B/32
Hi @celestialevermore, I do not think there is something wrong with this command, but why it will cause the error ', '--local_rank=3', '--do_train', '-
, which contains --local_rank=3
. Can you print your nvidia-smi
here?
I really thank you for your immediate reply first. Really Thanks.
Thu Feb 24 17:19:12 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:1A:00.0 Off | N/A |
| 33% 36C P8 4W / 250W | 3833MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... On | 00000000:1B:00.0 Off | N/A |
| 22% 32C P8 20W / 250W | 1184MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... On | 00000000:3D:00.0 Off | N/A |
| 22% 28C P8 2W / 250W | 2385MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... On | 00000000:3E:00.0 Off | N/A |
| 22% 37C P8 11W / 250W | 3MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA GeForce ... On | 00000000:88:00.0 Off | N/A |
| 22% 28C P8 22W / 250W | 5083MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA GeForce ... On | 00000000:89:00.0 Off | N/A |
| 22% 32C P8 19W / 250W | 3MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA GeForce ... On | 00000000:B1:00.0 Off | N/A |
| 22% 33C P8 18W / 250W | 3MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA GeForce ... On | 00000000:B2:00.0 Off | N/A |
| 22% 32C P8 17W / 250W | 1202MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 56702 C ...s/dialogue_new/bin/python 1915MiB | | 0 N/A N/A 67815 C ...s/dialogue_new/bin/python 1915MiB | | 1 N/A N/A 33373 C ...onda3/envs/gnn/bin/python 1181MiB | | 2 N/A N/A 39629 C ...onda3/envs/gnn/bin/python 1181MiB | | 2 N/A N/A 58176 C ...onda3/envs/gnn/bin/python 1201MiB | | 4 N/A N/A 20230 C ...da3/envs/evcf2/bin/python 2539MiB | | 4 N/A N/A 46449 C ...da3/envs/evcf2/bin/python 2539MiB | | 7 N/A N/A 60275 C ...vs/pytorch_env/bin/python 1199MiB | +-----------------------------------------------------------------------------+
Hi @celestialevermore, can you add CUDA_VISIBLE_DEVICES=0,1,2,3
before your command to try it again? Or have you added it?
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 ...
Dear Author, really thanks you all for opening this open source.
I'm a novice and do not have much techniques in dealing with and running all kinds of framework, so today, I've been suffered from error :
subprocess.CalledProcessError: Command '['/home/key2317/anaconda3/envs/CLIP4Clip/bin/python', '-u', 'main_task_retrieval.py', '--local_rank=3', '--do_train', '--num_thread_reader=0', '--epochs=5', '--batch_size=128', '--n_display=50', '--train_csv', '/home/key2317/CLIP4Clip/msrvtt_data/MSRVTT_train.9k.csv', '--val_csv', '/home/key2317/CLIP4Clip/msrvtt_data/MSRVTT_JSFUSION_test.csv', '--data_path', '/home/key2317/CLIP4Clip/msrvtt_data/MSRVTT_data.json', '--features_path', '/home/key2317/CLIP4Clip/msrvtt_data/MSRVTT_Videos', '--output_dir', 'ckpts/ckpt_msrvtt_retrieval_looseType', '--lr', '1e-4', '--max_words', '32', '--max_frames', '12', '--batch_size_val', '16', '--datatype', 'msrvtt', '--expand_msrvtt_sentences', '--feature_framerate', '1', '--coef_lr', '1e-3', '--freeze_layer_num', '0', '--slice_framepos', '2', '--loose_type', '--linear_patch', '2d', '--sim_header', 'meanP', '--pretrained_clip_name', 'ViT-B/32']' returned non-zero exit status 1.
From entering my starting command, It seemed to run well for about 5 seconds showing inner state like vison_layers: 12, vision_width : 768, blah blah blah. But unfortunately, It was all ended up with aforementioned messages.
One of my colleague guessed that It would be problem in our unmatching issue in GPU environments, in short, GPU problems, but I'm not sure how can I diagnose my problem.
I am really interested in your papers, also codes, but, because of this initial step, I cannot go for next. Would you please help me?
Thanks.