Multi-card/multi-gpu training.

Ailon-Island commented 1 year ago

Amazing work! I wonder how could I run the model in DistributedDataParallel mode as is (seemingly) supported by nerf\utils\Trainer. I encountered an OOM while training on GTX 2080Ti with a memory of 11GB. Everything would be nice for me if I were able to train the model on multiple cards.

MathieuTuli commented 1 year ago

Have you tried running the code first with a batch size of 1? I would start there, as this should confirm whether you'll be able to use your GPU setup at all. Meaning, the DistributedDataParallel is for distributed training where the data (batches) get parallelized across GPUs, so if for a batch size of 1 you're still getting OOM issues, it's because your GPU doesn't have enough memory for the model, which I believe is the case. I don't believe this code base supports distributed model training, so if you can't run on a single batch size, you'll need bigger GPUs for training.

Update/Edit: Please also see https://github.com/ashawkey/stable-dreamfusion/issues/89#issuecomment-1320025579

Ailon-Island commented 1 year ago

Ok I was expecting the NeRF component to eat the most memory while actually it is the diffusion model. So paralleling rays for NeRF does not help much. I’ve also found my own solution. Running guidance on another GPU instead just makes it runnable without compromising any quality for me on 2080Ti platform.

ashawkey / stable-dreamfusion

Multi-card/multi-gpu training. #98