NVlabs / stylegan3

Official PyTorch implementation of StyleGAN3
Other
6.38k stars 1.12k forks source link

RuntimeError: Timeout waiting for key: 0/0 #196

Open premchedella opened 2 years ago

premchedella commented 2 years ago

I am training StyleGAN3 with my own data set.

Following is the training command line:

python train.py --outdir=../output/output_15sep2022/ --cfg=stylegan3-t --data=../windows/large-512x512.zip --gpus=8 --batch=32 --gamma=8.2 --mirror=1

I am getting the following error:

Setting up PyTorch plugin "upfirdn2d_plugin"... Traceback (most recent call last):
  File "/ibex/ai/home/chedelp/stylegan3/stylegan3/train.py", line 286, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/home/chedelp/miniconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/chedelp/miniconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/chedelp/miniconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/chedelp/miniconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/ibex/ai/home/chedelp/stylegan3/stylegan3/train.py", line 281, in main
    launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
  File "/ibex/ai/home/chedelp/stylegan3/stylegan3/train.py", line 98, in launch_training
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus)
  File "/home/chedelp/miniconda3/envs/stylegan3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/chedelp/miniconda3/envs/stylegan3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/chedelp/miniconda3/envs/stylegan3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 5 terminated with the following error:
Traceback (most recent call last):
  File "/home/chedelp/miniconda3/envs/stylegan3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/ibex/ai/home/chedelp/stylegan3/stylegan3/train.py", line 47, in subprocess_fn
    training_loop.training_loop(rank=rank, **c)
  File "/ibex/ai/home/chedelp/stylegan3/stylegan3/training/training_loop.py", line 188, in training_loop
    torch.distributed.broadcast(param, src=0)
  File "/home/chedelp/miniconda3/envs/stylegan3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1090, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: Timeout waiting for key: 0/0

Help me to solve the error.