Training fails on multi-gpu setup

mike-athene commented 2 years ago

Hello StyleNeRF folks, thank you so much for releasing the code!

I am trying to train the model on a 8xA6000 box with no success so far.

python run_train.py outdir=/root/out data=/root/256.zip spec=paper256 model=stylenerf_ffhq

I have validated that a single GPU A6000 does work, I've also used the provided configs.

I am running Ubuntu 20.04.3 LTS with Pytorch LTS (1.8.2) and CUDA 11.1 (which is necessary for A6000 support AFAIK).

Here is the stack trace I am getting, lmk if I can provide any additional information:

Error executing job with overrides: ['outdir=/root/out', 'data=/root/P256.zip', 'spec=paper256', 'model=stylenerf_ffhq']
Traceback (most recent call last):
  File "run_train.py", line 396, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/usr/local/lib/python3.8/dist-packages/hydra/main.py", line 49, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 367, in _run_hydra
    run_and_report(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 368, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 110, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "run_train.py", line 378, in main
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(args,), nprocs=args.num_gpus)
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 5 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/root/StyleNeRF/run_train.py", line 302, in subprocess_fn
    training_loop.training_loop(**args)
  File "/root/StyleNeRF/training/training_loop.py", line 221, in training_loop
    module = torch.nn.parallel.DistributedDataParallel(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 448, in __init__
    self._ddp_init_helper()
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 603, in _ddp_init_helper
    self.reducer = dist.Reducer(
RuntimeError: replicas[0][0] in this process with strides [60, 1, 1, 1] appears not to match strides of the same param in process 0.

srx960212 commented 2 years ago

Can you provide detailed instructions on how to run the code please, I also have the problem with training on multi-gpus. Thank you~

MultiPath commented 2 years ago

As shown in the readme, it should automatically use all GPUs available for running the code. It looks strange. Can you also try pytorch 1.7.1 if possible?

My config is as written in the readme:

Python 3.7, PyTorch 1.7.1 8 Nvidia GPU (Tesla V100 32GB) with CUDA version 11.0

miaopass commented 2 years ago

Hello StyleNeRF folks, thank you so much for releasing the code!

I am trying to train the model on a 8xA6000 box with no success so far.

python run_train.py outdir=/root/out data=/root/256.zip spec=paper256 model=stylenerf_ffhq

I have validated that a single GPU A6000 does work, I've also used the provided configs.

I am running Ubuntu 20.04.3 LTS with Pytorch LTS (1.8.2) and CUDA 11.1 (which is necessary for A6000 support AFAIK).

Here is the stack trace I am getting, lmk if I can provide any additional information:

Error executing job with overrides: ['outdir=/root/out', 'data=/root/P256.zip', 'spec=paper256', 'model=stylenerf_ffhq']
Traceback (most recent call last):
  File "run_train.py", line 396, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/usr/local/lib/python3.8/dist-packages/hydra/main.py", line 49, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 367, in _run_hydra
    run_and_report(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 368, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 110, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "run_train.py", line 378, in main
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(args,), nprocs=args.num_gpus)
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 5 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/root/StyleNeRF/run_train.py", line 302, in subprocess_fn
    training_loop.training_loop(**args)
  File "/root/StyleNeRF/training/training_loop.py", line 221, in training_loop
    module = torch.nn.parallel.DistributedDataParallel(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 448, in __init__
    self._ddp_init_helper()
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 603, in _ddp_init_helper
    self.reducer = dist.Reducer(
RuntimeError: replicas[0][0] in this process with strides [60, 1, 1, 1] appears not to match strides of the same param in process 0.

I have the same problem, and I solved it by commenting out the following code. https://github.com/facebookresearch/StyleNeRF/blob/03d3800500385fffeaa2df09fca649edb001b0bb/training/training_loop.py#L190-L195 I think some parameters changed by these code and it only happened in model with rank 0.

MultiPath commented 2 years ago

Interesting... it works quite fine on my machine. This part is only for print module summary

mike-athene commented 2 years ago

Thank you, miaopass, can confirm that it solved the issue on my side also!

P.S. I can confirm Pytorch LTS (1.8.2) and CUDA 11.1 with Python 3.8 is a working configuration. Also, I think you are using 3.8 and not 3.7 because this import (https://github.com/facebookresearch/StyleNeRF/blob/main/run_train.py#L4) would not work in 3.7

universome commented 2 years ago

I got the same issue. Commenting out module summary printing indeed fixed this somehow... JFYI: training on 4x A100 GPUs on FFHQ 256x256 gives 4.75 sec/img

MultiPath commented 2 years ago

Please check https://github.com/facebookresearch/StyleNeRF/commit/7f5610a058f27fcc360c6b972181983d7df794cb

facebookresearch / StyleNeRF

Training fails on multi-gpu setup #8