gafniguy / 4D-Facial-Avatars

Dynamic Neural Radiance Fields for Monocular 4D Facial Avater Reconstruction
670 stars 67 forks source link

Could this project be trained on multi-gpus? #56

Closed szh-bash closed 1 year ago

szh-bash commented 1 year ago

It doesn't seem to have been designed for training on multiple GPUs. When i simply add nn.DataParallel in train_transformed_rays.py, some error appeared

  model_coarse = nn.DataParallel(model_coarse)
  model_coarse.to(device)
  ...
  model_fine = nn.DataParallel(model_fine)
  model_fine.to(device)

Done with data loading done loading data Available GPUs: 8 loading GT background to condition on bg shape torch.Size([512, 512, 3]) should be torch.Size([512, 512, 3]) initialized latent codes with shape 100 X 32 computing boundix boxes probability maps Starting loop 0%| | 0/1000000 [00:26<?, ?it/s] Traceback (most recent call last): File "train_transformed_rays.py", line 613, in main() File "train_transformed_rays.py", line 347, in main rgbcoarse, , _, rgbfine, , _, weights = run_one_iter_of_nerf( File "/data/shenzhonghai/4D-Facial-Avatars-main/nerface_code/nerf-pytorch/nerf/train_utils.py", line 268, in run_one_iter_of_nerf pred = [ File "/data/shenzhonghai/4D-Facial-Avatars-main/nerface_code/nerf-pytorch/nerf/train_utils.py", line 269, in predict_and_render_radiance( File "/data/shenzhonghai/4D-Facial-Avatars-main/nerface_code/nerf-pytorch/nerf/train_utils.py", line 100, in predict_and_render_radiance radiance_field = run_network( File "/data/shenzhonghai/4D-Facial-Avatars-main/nerface_code/nerf-pytorch/nerf/train_utils.py", line 36, in run_network pred = network_fn(batch, expressions, latent_code) File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, *kwargs) File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/data/shenzhonghai/4D-Facial-Avatars-main/nerface_code/nerf-pytorch/nerf/models.py", line 248, in forward x = self.layers_xyzi File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 93, in forward return F.linear(input, self.weight, self.bias) File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/nn/functional.py", line 1690, in linear ret = torch.addmm(bias, input, weight.t()) RuntimeError: mat1 dim 1 must match mat2 dim 0

szh-bash commented 1 year ago

I figured it out finally

nerf-pytorch/nerf/train_utils.py
network_fn(batch, expression, latent_code)  -> network_fn(embedded, expression.repeat(gpu_num), latent_code.repeat(gpu_num)

Now i got 8x speed up on 8x1080TI, only need to train 12hours reach 400k iters