Closed Ailon-Island closed 1 year ago
Have you tried running the code first with a batch size of 1
? I would start there, as this should confirm whether you'll be able to use your GPU setup at all. Meaning, the DistributedDataParallel
is for distributed training where the data (batches) get parallelized across GPUs, so if for a batch size of 1
you're still getting OOM issues, it's because your GPU doesn't have enough memory for the model, which I believe is the case. I don't believe this code base supports distributed model training, so if you can't run on a single batch size, you'll need bigger GPUs for training.
Update/Edit: Please also see https://github.com/ashawkey/stable-dreamfusion/issues/89#issuecomment-1320025579
Ok I was expecting the NeRF component to eat the most memory while actually it is the diffusion model. So paralleling rays for NeRF does not help much.
I’ve also found my own solution. Running guidance
on another GPU instead just makes it runnable without compromising any quality for me on 2080Ti platform.
Amazing work! I wonder how could I run the model in
DistributedDataParallel
mode as is (seemingly) supported bynerf\utils\Trainer
. I encountered an OOM while training on GTX 2080Ti with a memory of 11GB. Everything would be nice for me if I were able to train the model on multiple cards.