multi gpu training hangs in python threading

MengXinChengXuYuan commented 4 years ago

Using the training configuration below:

--name face_no_warpref_64 --dataset_mode fewshot_face --adaptive_spade --gpu_ids 0,1,2,3,4,5 --batchSize 224 --nThreads 12 --tf_log --niter_single 3000 --loadSize 128 --fineSize 128 --gan_mode ls --continue_train --which_epoch latest

It will hang in python threading incidentally, whether at the very begaining of the training process, after the first epoch, or sometime after several epochs

''' Traceback (most recent call last): File "/home/xxx/image_augmentation/gan/few-shot-vid2vid-master/train.py", line 72, in train() File "/home/xxx/image_augmentation/gan/few-shot-vid2vid-master/train.py", line 56, in train d_losses = model(data_list_t, mode='discriminator') File "/home/xxx/miniconda3/envs/py3.6_torch1.1.0/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/home/xxx/image_augmentation/gan/few-shot-vid2vid-master/models/models.py", line 86, in forward outputs = self.model(*inputs, *kwargs, dummy_bs=self.pad_bs) File "/home/liusiyao/miniconda3/envs/py3.6_torch1.1.0/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/home/xxx/miniconda3/envs/py3.6_torch1.1.0/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/xxx/miniconda3/envs/py3.6_torch1.1.0/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/xxx/miniconda3/envs/py3.6_torch1.1.0/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 75, in parallel_apply thread.join() File "/home/xxx/miniconda3/envs/py3.6_torch1.1.0/lib/python3.6/threading.py", line 1056, in join self._wait_for_tstate_lock() File "/home/xxx/miniconda3/envs/py3.6_torch1.1.0/lib/python3.6/threading.py", line 1072, in _wait_for_tstate_lock elif lock.acquire(block, timeout): KeyboardInterrupt '''

I think this is a pytorch bug? I'm using pytorch 1.1.0 (because I'm not permitted to update the gpu driver) instead of 1.3.0, having 6 gtx titan v Does anyone else have the same problem with me?

iloveOREO commented 4 years ago

Maybe your "--batchSize 224" is too large, and the original batchSize is 60 with 8 gpus. You can try a small batchSize.

MengXinChengXuYuan commented 4 years ago

@iloveOREO Thank you for your kind help! However I tried to lower the batch size, the problem still exist It's So strange, I just can't figure out what could be wrong And if I'm just 'lucky' enough, the code can run without any problem sometimes :(

iloveOREO commented 4 years ago

@MengXinChengXuYuan You're welcome! I think the error maybe raise by multi-process, which may under the influence of param "--nThreads". I am using "pytorch/pytorch:1.2-cuda10.0-cudnn7-devel" docker image and it works fine with me. I had run example with given pose dataset on Tesla P100 by "python train.py --name pose --dataset_mode fewshot_pose \ --adaptive_spade --warp_ref --spade_combine --remove_face_labels --add_face_D \ --niter_single 100 --niter 200 \ --gpu_ids 0,1 --batchSize 4 --nThreads 2 --continue_train " Hope it can be helpful.

MengXinChengXuYuan commented 4 years ago

@iloveOREO I see! Thank you for your help :p I will try to lower the threads I use

NVlabs / few-shot-vid2vid

multi gpu training hangs in python threading #10