Multiple GPU training support

sleeplessai commented 3 years ago

Does this model support multi-GPU training ? If do support and require, how to set in code ?

frspring commented 3 years ago

I found commands about multi-GPU training in train_mvs_nerf_pl.py but it turned out not working

zhuangdf commented 2 years ago

@sleeplessai @frspring It seems that to enable multi-GPU training such as DDP, there are some modifications to do.

Change the prepare_data() method in class MVSSystem() to setup() method because the prepare_data() method will only be executed in rank 0, and you may get errors like "self object does not have attribute val_dataset" from other ranks.
You also need to change .to(device) and .cuda() method everywhere to .type_as() method to make the system scale to any number of GPUs.
Pytorch lightning will automatically move any LightningModule attributes with a .to() method to target GPU. However, if you wrap up everything in a dictionary then this will not happen because a dictionary does not have a .to() method. For the automatic moving to happen, you may need to take out the models from the dictionary and assign them to attributes separately such as self.MVSNet = self.render_kwargs_train['network_mvs'] self.network = self.render_kwargs_train['network_fn'] self.network_fine = self.render_kwargs_train['network_fine'] self.render_kwargs_train.pop('network_mvs') self.render_kwargs_train.pop('network_fn') self.render_kwargs_train.pop('network_fine') and change other relevant parts accordingly. After the above three steps, train_mvs_nerf_pl.py works on my server (with 5 TiTAN RTXs). Please refer to the following links for details. https://pytorch-lightning.readthedocs.io/en/latest/advanced/multi_gpu.html https://github.com/PyTorchLightning/pytorch-lightning/issues/2515 Hope this can help you.

c1enyang commented 2 years ago

@zhuangdf Great! Could you please share the mult-gpu mvsnerf code?

yifanjiang19 commented 1 year ago

@zhuangdf Hi, thanks for sharing this! May I know why do you need to self.render_kwargs_train.pop('network_fn')? Thanks!

apchenstu / mvsnerf