Closed szh-bash closed 1 year ago
I figured it out finally
nerf-pytorch/nerf/train_utils.py
network_fn(batch, expression, latent_code) -> network_fn(embedded, expression.repeat(gpu_num), latent_code.repeat(gpu_num)
Now i got 8x speed up on 8x1080TI, only need to train 12hours reach 400k iters
It doesn't seem to have been designed for training on multiple GPUs. When i simply add nn.DataParallel in train_transformed_rays.py, some error appeared
Done with data loading done loading data Available GPUs: 8 loading GT background to condition on bg shape torch.Size([512, 512, 3]) should be torch.Size([512, 512, 3]) initialized latent codes with shape 100 X 32 computing boundix boxes probability maps Starting loop 0%| | 0/1000000 [00:26<?, ?it/s] Traceback (most recent call last): File "train_transformed_rays.py", line 613, in
main()
File "train_transformed_rays.py", line 347, in main
rgbcoarse, , _, rgbfine, , _, weights = run_one_iter_of_nerf(
File "/data/shenzhonghai/4D-Facial-Avatars-main/nerface_code/nerf-pytorch/nerf/train_utils.py", line 268, in run_one_iter_of_nerf
pred = [
File "/data/shenzhonghai/4D-Facial-Avatars-main/nerface_code/nerf-pytorch/nerf/train_utils.py", line 269, in
predict_and_render_radiance(
File "/data/shenzhonghai/4D-Facial-Avatars-main/nerface_code/nerf-pytorch/nerf/train_utils.py", line 100, in predict_and_render_radiance
radiance_field = run_network(
File "/data/shenzhonghai/4D-Facial-Avatars-main/nerface_code/nerf-pytorch/nerf/train_utils.py", line 36, in run_network
pred = network_fn(batch, expressions, latent_code)
File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, kwargs)
File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, *kwargs)
File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(input, kwargs)
File "/data/shenzhonghai/4D-Facial-Avatars-main/nerface_code/nerf-pytorch/nerf/models.py", line 248, in forward
x = self.layers_xyzi
File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 93, in forward
return F.linear(input, self.weight, self.bias)
File "/data/shenzhonghai/anaconda3/envs/4DFA/lib/python3.8/site-packages/torch/nn/functional.py", line 1690, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: mat1 dim 1 must match mat2 dim 0