Closed mike-athene closed 2 years ago
Can you provide detailed instructions on how to run the code please, I also have the problem with training on multi-gpus. Thank you~
As shown in the readme, it should automatically use all GPUs available for running the code. It looks strange. Can you also try pytorch 1.7.1 if possible?
My config is as written in the readme:
Python 3.7, PyTorch 1.7.1 8 Nvidia GPU (Tesla V100 32GB) with CUDA version 11.0
Hello StyleNeRF folks, thank you so much for releasing the code!
I am trying to train the model on a 8xA6000 box with no success so far.
python run_train.py outdir=/root/out data=/root/256.zip spec=paper256 model=stylenerf_ffhq
I have validated that a single GPU A6000 does work, I've also used the provided configs.
I am running Ubuntu 20.04.3 LTS with Pytorch LTS (1.8.2) and CUDA 11.1 (which is necessary for A6000 support AFAIK).
Here is the stack trace I am getting, lmk if I can provide any additional information:
Error executing job with overrides: ['outdir=/root/out', 'data=/root/P256.zip', 'spec=paper256', 'model=stylenerf_ffhq'] Traceback (most recent call last): File "run_train.py", line 396, in <module> main() # pylint: disable=no-value-for-parameter File "/usr/local/lib/python3.8/dist-packages/hydra/main.py", line 49, in decorated_main _run_hydra( File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 367, in _run_hydra run_and_report( File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 214, in run_and_report raise ex File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report return func() File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 368, in <lambda> lambda: hydra.run( File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 110, in run _ = ret.return_value File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 233, in return_value raise self._return_value File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 160, in run_job ret.return_value = task_function(task_cfg) File "run_train.py", line 378, in main torch.multiprocessing.spawn(fn=subprocess_fn, args=(args,), nprocs=args.num_gpus) File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 5 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/root/StyleNeRF/run_train.py", line 302, in subprocess_fn training_loop.training_loop(**args) File "/root/StyleNeRF/training/training_loop.py", line 221, in training_loop module = torch.nn.parallel.DistributedDataParallel( File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 448, in __init__ self._ddp_init_helper() File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 603, in _ddp_init_helper self.reducer = dist.Reducer( RuntimeError: replicas[0][0] in this process with strides [60, 1, 1, 1] appears not to match strides of the same param in process 0.
I have the same problem, and I solved it by commenting out the following code. https://github.com/facebookresearch/StyleNeRF/blob/03d3800500385fffeaa2df09fca649edb001b0bb/training/training_loop.py#L190-L195 I think some parameters changed by these code and it only happened in model with rank 0.
Interesting... it works quite fine on my machine. This part is only for print module summary
Thank you, miaopass, can confirm that it solved the issue on my side also!
P.S. I can confirm Pytorch LTS (1.8.2) and CUDA 11.1 with Python 3.8 is a working configuration. Also, I think you are using 3.8 and not 3.7 because this import (https://github.com/facebookresearch/StyleNeRF/blob/main/run_train.py#L4) would not work in 3.7
I got the same issue. Commenting out module summary printing indeed fixed this somehow... JFYI: training on 4x A100 GPUs on FFHQ 256x256 gives 4.75 sec/img
Hello StyleNeRF folks, thank you so much for releasing the code!
I am trying to train the model on a 8xA6000 box with no success so far.
python run_train.py outdir=/root/out data=/root/256.zip spec=paper256 model=stylenerf_ffhq
I have validated that a single GPU A6000 does work, I've also used the provided configs.
I am running Ubuntu 20.04.3 LTS with Pytorch LTS (1.8.2) and CUDA 11.1 (which is necessary for A6000 support AFAIK).
Here is the stack trace I am getting, lmk if I can provide any additional information: