Closed Ir1d closed 2 years ago
I think I figured out the OOM issue, but still confused by the shape of the data.
There shape are torch.Size([100, 20, 3]) torch.Size([100, 40, 3])
, respectively.
len(var["idx"])
is 50 in https://github.com/chenhsuanlin/bundle-adjusting-NeRF/blob/main/model/nerf.py#L201 , while 100 in https://github.com/chenhsuanlin/bundle-adjusting-NeRF/blob/main/model/nerf.py#L217
Hi @Ir1d, the codebase doesn't support multi-GPU training, and I don't have much experience with (Distributed)DataParallel. It also seems that you've made significant changes in your fork, so unfortunately it will be hard for me to identify the issue.
Thank you! No worries.
Hi @Ir1d did you end up solving your problem and having a functioning version for multi GPU training? Thanks!
@andrearama, you can try my fork. but i don't think it works perfectly well.
Hi @chenhsuanlin
Thank you for sharing this nice work. I'm just curious if you happen to have multi-gpu training code by hand? I was trying to train BARF with multi GPU, but got stuck in a weird OOM issue: the GPU memory explode into over 50G, while your original code base takes less than 10G on blender/lego
Here's the edit I made: https://github.com/Ir1d/bundle-adjusting-NeRF/commit/904228c3a243e939d96e5595f7073779f95b997a The command to run:
CUDA_VISIBLE_DEVICES=0,1 python train.py --group=blender --model=barf --yaml=barf_blender --name=lego_baseline --data.scene=lego --gpu=1 --visdom! --batch_size=2
Do you know what might be the leading to the OOM here? Thank you!