NVlabs / stylegan3

Official PyTorch implementation of StyleGAN3
Other
6.38k stars 1.12k forks source link

training crashed #9

Open thepowerfuldeez opened 2 years ago

thepowerfuldeez commented 2 years ago

After some time training got crashed by some unexpected behavior. I've tried with batch size 24 and 48 on 3 gpus.


Evaluating metrics...
{"results": {"fid50k_full": 10.638274366895118}, "metric": "fid50k_full", "total_time": 491.9900858402252, "total_time_str": "8m 12s", "num_gpus": 3, "snapshot_pkl": "network-snapshot-000000.pkl", "timestamp": 1634026244.8660324}
tick 1     kimg 4.0      time 13m 32s      sec/tick 224.1   sec/kimg 55.93   maintenance 522.8  cpumem 4.30   gpumem 4.49   reserved 4.96   augment 0.037
tick 2     kimg 8.0      time 17m 16s      sec/tick 224.3   sec/kimg 55.97   maintenance 0.1    cpumem 4.27   gpumem 4.50   reserved 4.96   augment 0.075
tick 3     kimg 12.0     time 21m 07s      sec/tick 230.4   sec/kimg 57.50   maintenance 0.1    cpumem 4.29   gpumem 4.50   reserved 4.96   augment 0.113
tick 4     kimg 16.0     time 24m 55s      sec/tick 227.8   sec/kimg 56.83   maintenance 0.1    cpumem 4.28   gpumem 4.53   reserved 4.96   augment 0.151
tick 5     kimg 20.1     time 31m 16s      sec/tick 381.3   sec/kimg 95.15   maintenance 0.1    cpumem 4.30   gpumem 4.56   reserved 4.96   augment 0.180
tick 6     kimg 24.1     time 41m 39s      sec/tick 622.9   sec/kimg 155.42  maintenance 0.1    cpumem 4.29   gpumem 4.52   reserved 4.96   augment 0.214
tick 7     kimg 28.1     time 46m 16s      sec/tick 277.0   sec/kimg 69.11   maintenance 0.1    cpumem 4.30   gpumem 4.58   reserved 4.96   augment 0.234
tick 8     kimg 32.1     time 54m 38s      sec/tick 501.8   sec/kimg 125.21  maintenance 0.1    cpumem 4.27   gpumem 4.55   reserved 4.96   augment 0.247
Traceback (most recent call last):
  File "/home/george/work/stylegan3/train.py", line 286, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/home/george/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/home/george/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/home/george/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/george/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/home/george/work/stylegan3/train.py", line 281, in main
    launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
  File "/home/george/work/stylegan3/train.py", line 98, in launch_training
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus)
  File "/home/george/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/george/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/george/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 130, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGKILL
(stylegan3) george@frog:~/work/stylegan3$ /home/george/anaconda3/envs/stylegan3/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 51 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '```

**Expected behavior**
A clear and concise description of what you expected to happen.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Desktop (please complete the following information):**
 - OS: Ubuntu 20.04
 - PyTorch version 1.9
 - CUDA toolkit version CUDA 11.3
 - GPU RTX 3090
 - Docker: No
manoadamro commented 2 years ago

I vaguely remember seeing this when I was setting up stylegan-2 a while back. I'm pretty sure my fix was to change python version from 3.9 to 3.7 🤔

thepowerfuldeez commented 2 years ago

@manoadamro hmm that's unfortunate cause I set up conda environment as in README.md but thanks

manoadamro commented 2 years ago

@thepowerfuldeez you should still be able to use conda. I've just noticed the readme for this version suggests python 3.8. you could try creating a new env with conda create --name <env_name> python=3.8 and see if that helps?

jannehellsten commented 2 years ago

@thepowerfuldeez did you have any luck?