Closed MengXinChengXuYuan closed 4 years ago
Maybe your "--batchSize 224" is too large, and the original batchSize is 60 with 8 gpus. You can try a small batchSize.
@iloveOREO Thank you for your kind help! However I tried to lower the batch size, the problem still exist It's So strange, I just can't figure out what could be wrong And if I'm just 'lucky' enough, the code can run without any problem sometimes :(
@MengXinChengXuYuan You're welcome! I think the error maybe raise by multi-process, which may under the influence of param "--nThreads". I am using "pytorch/pytorch:1.2-cuda10.0-cudnn7-devel" docker image and it works fine with me. I had run example with given pose dataset on Tesla P100 by "python train.py --name pose --dataset_mode fewshot_pose \ --adaptive_spade --warp_ref --spade_combine --remove_face_labels --add_face_D \ --niter_single 100 --niter 200 \ --gpu_ids 0,1 --batchSize 4 --nThreads 2 --continue_train " Hope it can be helpful.
@iloveOREO I see! Thank you for your help :p I will try to lower the threads I use
Using the training configuration below:
--name face_no_warpref_64 --dataset_mode fewshot_face --adaptive_spade --gpu_ids 0,1,2,3,4,5 --batchSize 224 --nThreads 12 --tf_log --niter_single 3000 --loadSize 128 --fineSize 128 --gan_mode ls --continue_train --which_epoch latest
It will hang in python threading incidentally, whether at the very begaining of the training process, after the first epoch, or sometime after several epochs
''' Traceback (most recent call last): File "/home/xxx/image_augmentation/gan/few-shot-vid2vid-master/train.py", line 72, in
train()
File "/home/xxx/image_augmentation/gan/few-shot-vid2vid-master/train.py", line 56, in train
d_losses = model(data_list_t, mode='discriminator')
File "/home/xxx/miniconda3/envs/py3.6_torch1.1.0/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, kwargs)
File "/home/xxx/image_augmentation/gan/few-shot-vid2vid-master/models/models.py", line 86, in forward
outputs = self.model(*inputs, *kwargs, dummy_bs=self.pad_bs)
File "/home/liusiyao/miniconda3/envs/py3.6_torch1.1.0/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(input, kwargs)
File "/home/xxx/miniconda3/envs/py3.6_torch1.1.0/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/xxx/miniconda3/envs/py3.6_torch1.1.0/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/xxx/miniconda3/envs/py3.6_torch1.1.0/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 75, in parallel_apply
thread.join()
File "/home/xxx/miniconda3/envs/py3.6_torch1.1.0/lib/python3.6/threading.py", line 1056, in join
self._wait_for_tstate_lock()
File "/home/xxx/miniconda3/envs/py3.6_torch1.1.0/lib/python3.6/threading.py", line 1072, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt
'''
I think this is a pytorch bug? I'm using pytorch 1.1.0 (because I'm not permitted to update the gpu driver) instead of 1.3.0, having 6 gtx titan v Does anyone else have the same problem with me?