Describe the bug
I try to train StyleGAN2 on FFHQ512 with standard settings. Some training runs worked before and then at some point it started outputting this error every time. Might not be related to the code unless it is accumulating something somewhere.
To Reproduce
Standard training command
Constructing networks...
Traceback (most recent call last):
File "train.py", line 286, in
main() # pylint: disable=no-value-for-parameter
File "~/stylegan3/venv/lib/python3.8/site-packages/click/core.py", line 1130, in call
return self.main(args, kwargs)
File "~/stylegan3/venv/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "~/stylegan3/venv/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, ctx.params)
File "~/stylegan3/venv/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(args, **kwargs)
File "train.py", line 281, in main
launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
File "train.py", line 98, in launch_training
torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus)
File "~/stylegan3/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "~/stylegan3/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "~/stylegan3/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "~/stylegan3/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "~/stylegan3/train.py", line 47, in subprocess_fn
training_loop.training_loop(rank=rank, c)
File "~/stylegan3/training/training_loop.py", line 152, in training_loop
G = dnnlib.util.construct_class_by_name(G_kwargs, **common_kwargs).train().requiresgrad(False).to(device) # subclass of torch.nn.Module
File "~/stylegan3/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 989, in to
return self._apply(convert)
File "~/stylegan3/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "~/stylegan3/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "~/stylegan3/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "~/stylegan3/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 664, in _apply
param_applied = fn(param)
File "~/stylegan3/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 987, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Expected behavior
Training should start
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
OS: Linux Ubuntu 22.04
PyTorch version 1.13.0
CUDA toolkit version 11.7
GPU A6000
Docker: did you use Docker? No
Additional context
Add any other context about the problem here.
Describe the bug I try to train StyleGAN2 on FFHQ512 with standard settings. Some training runs worked before and then at some point it started outputting this error every time. Might not be related to the code unless it is accumulating something somewhere.
To Reproduce Standard training command
Constructing networks... Traceback (most recent call last): File "train.py", line 286, in
main() # pylint: disable=no-value-for-parameter
File "~/stylegan3/venv/lib/python3.8/site-packages/click/core.py", line 1130, in call
return self.main(args, kwargs)
File "~/stylegan3/venv/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "~/stylegan3/venv/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, ctx.params)
File "~/stylegan3/venv/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(args, **kwargs)
File "train.py", line 281, in main
launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
File "train.py", line 98, in launch_training
torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus)
File "~/stylegan3/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "~/stylegan3/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "~/stylegan3/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 3 terminated with the following error: Traceback (most recent call last): File "~/stylegan3/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "~/stylegan3/train.py", line 47, in subprocess_fn training_loop.training_loop(rank=rank, c) File "~/stylegan3/training/training_loop.py", line 152, in training_loop G = dnnlib.util.construct_class_by_name(G_kwargs, **common_kwargs).train().requiresgrad(False).to(device) # subclass of torch.nn.Module File "~/stylegan3/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 989, in to return self._apply(convert) File "~/stylegan3/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) File "~/stylegan3/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) File "~/stylegan3/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) [Previous line repeated 1 more time] File "~/stylegan3/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 664, in _apply param_applied = fn(param) File "~/stylegan3/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 987, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Expected behavior Training should start
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Additional context Add any other context about the problem here.