Multi-gpu model training does not work

avantikalal commented 4 years ago

I tried training a model using the --distributed flag on 8 GPUs, using the following command:

python /alal-AtacWorks/AtacWorks/main.py --train --train_files /alal-atac-data/dsc/h5/train.10 --val_files /alal-atac-data/dsc/h5/val.10 --model resnet --nblocks 5 --nblocks_cla 2 --width 51 --width_cla 51 --dil 8 --dil_cla 8 --nfilt 15 --nfilt_cla 15 --mse_weight 0.0005 --pearson_weight 1 --checkpoint_fname checkpoint.pth.tar --out_home /alal-atac-data/dsc/models --label resnet.10 --task both --eval_freq 1 --bs 64 --distributed --epochs 25 --lr 0.0002 --pad 5000

and got the following output:

INFO:2020-02-11 05:46:51,863:AtacWorks-main] Distributing to 8 GPUS
INFO:2020-02-11 05:47:03,557:AtacWorks-model_utils] Compiling model in DistributedDataParallel
INFO:2020-02-11 05:47:03,807:AtacWorks-model_utils] Compiling model in DistributedDataParallel
INFO:2020-02-11 05:47:03,991:AtacWorks-model_utils] Compiling model in DistributedDataParallel
INFO:2020-02-11 05:47:04,125:AtacWorks-model_utils] Compiling model in DistributedDataParallel
INFO:2020-02-11 05:47:04,164:AtacWorks-model_utils] Compiling model in DistributedDataParallel
INFO:2020-02-11 05:47:04,229:AtacWorks-model_utils] Compiling model in DistributedDataParallel
INFO:2020-02-11 05:47:04,231:AtacWorks-model_utils] Compiling model in DistributedDataParallel
INFO:2020-02-11 05:47:04,256:AtacWorks-model_utils] Compiling model in DistributedDataParallel
Traceback (most recent call last):
File "/alal-AtacWorks/AtacWorks/main.py", line 440, in <module>
main()
File "/alal-AtacWorks/AtacWorks/main.py", line 352, in main
args=(ngpus_per_node, args), join=True)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 4 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/alal-AtacWorks/AtacWorks/worker.py", line 177, in train_worker
os.mkdir(config_dir)
FileExistsError: [Errno 17] File exists: '/alal-atac-data/dsc/models/resnet.10_2020.02.11_05.46/configs'

When --distributed was replaced with --gpu 0, the same command worked fine and trained on 1 of 8 GPUs.

ntadimeti commented 4 years ago

I'm not sure I follow this completely. I'm running the example/run.sh script with epochs=25 and --distributed on ToT dev-v0.2.0. It runs just fine.

Were you running this on NGC ? Does example command run fine for you ?

ntadimeti commented 4 years ago

fixed.

NVIDIA-Genomics-Research / AtacWorks

Multi-gpu model training does not work #118