Support for multiple GPUs

NVlabs / nvdiffrecmc

Official code for the NeurIPS 2022 paper "Shape, Light, and Material Decomposition from Images using Monte Carlo Rendering and Denoising".

Other

363 stars 28 forks source link

Support for multiple GPUs #19

Open Selozhd opened 1 year ago

Selozhd commented 1 year ago

I am planning to run the model on multiple GPUs. However, Looking at the way optimize_mesh() is written it is not immediately clear how to implement it. In nvdiffrec, there used to be multiple gpu support implemented through a Trainer class. Is any particular reason why you removed it?

jmunkberg commented 1 year ago

Hello,

We removed it from the public repo to make the code a bit easier to read and support. You can likely do something similar to the nvdiffrec mGPU setup.

Selozhd commented 1 year ago

Hello again,

I have a working first implementation but I am encountering some problems as far as GPU memory is concerned. For example, I still get CUDA out of memory errors when I increase the batch size by one despite effectively having x4 memory. This leads me to suspect that some processes are shared between the GPUs. Maybe you have some insights to help me here?

I have also noticed that DMTetGeometry and DLMesh implementations are different (even though they are very similar algorithmically) between nvdiffrec and nvdiffrecmc, for example you don't inherit them from torch's nn.Module. Is there any specific reason for this?

iraj465 commented 1 year ago

Hey were you able to get the multi-gpu setup working?

Selozhd commented 1 year ago

Yeah partially, I had to do a few hacky things in the data processing to get it to work. But at the end I could process a batch across multiple gpus.

iraj465 commented 1 year ago

I tried with distributed data parallel and somewhat changing to pytorvch lightning but getting seg fault on pretty low res images and batch sizes. How did you resolve it? It would be nice to discuss further

Selozhd commented 1 year ago

I tried with distributed data parallel and somewhat changing to pytorvch lightning but getting seg fault on pretty low res images and batch sizes. How did you resolve it? It would be nice to discuss further

I never got a segfault error. How are you trying to implement the parallelism? I think you can only expect to divide the batches across the gpus. Here is briefly what I have done:

Rewritten the classes for light, material, dmtet, and dlmesh using 'nn.Module'.
Changed some parts of 'optimize_mesh()' to handle training parameters and re-added the 'Trainer' class. The old nvdiffrec code is a good reference here.
Finally added the boilerplate code for torch to handle the rest. I used torch's 'DistributedDataParallel'.

VLadImirluren commented 1 year ago

@Selozhd thanks! Can you share the code for reference~~ best wish~~