Multi-GPU training (DataParallel)

chenhsuanlin / bundle-adjusting-NeRF

BARF: Bundle-Adjusting Neural Radiance Fields 🤮 (ICCV 2021 oral)

MIT License

793 stars 114 forks source link

Multi-GPU training (DataParallel) #37

Closed Ir1d closed 2 years ago

Ir1d commented 2 years ago

Hi @chenhsuanlin

Thank you for sharing this nice work. I'm just curious if you happen to have multi-gpu training code by hand? I was trying to train BARF with multi GPU, but got stuck in a weird OOM issue: the GPU memory explode into over 50G, while your original code base takes less than 10G on blender/lego

Here's the edit I made: https://github.com/Ir1d/bundle-adjusting-NeRF/commit/904228c3a243e939d96e5595f7073779f95b997a The command to run: CUDA_VISIBLE_DEVICES=0,1 python train.py --group=blender --model=barf --yaml=barf_blender --name=lego_baseline --data.scene=lego --gpu=1 --visdom! --batch_size=2

Do you know what might be the leading to the OOM here? Thank you!

Ir1d commented 2 years ago

I think I figured out the OOM issue, but still confused by the shape of the data. There shape are torch.Size([100, 20, 3]) torch.Size([100, 40, 3]), respectively. len(var["idx"]) is 50 in https://github.com/chenhsuanlin/bundle-adjusting-NeRF/blob/main/model/nerf.py#L201 , while 100 in https://github.com/chenhsuanlin/bundle-adjusting-NeRF/blob/main/model/nerf.py#L217

chenhsuanlin commented 2 years ago

Hi @Ir1d, the codebase doesn't support multi-GPU training, and I don't have much experience with (Distributed)DataParallel. It also seems that you've made significant changes in your fork, so unfortunately it will be hard for me to identify the issue.

Ir1d commented 2 years ago

Thank you! No worries.

andrearama commented 2 years ago

Hi @Ir1d did you end up solving your problem and having a functioning version for multi GPU training? Thanks!

Ir1d commented 2 years ago

@andrearama, you can try my fork. but i don't think it works perfectly well.