uneven GPU memory caused by multi-gpu training

alexklwong / calibrated-backprojection-network

PyTorch Implementation of Unsupervised Depth Completion with Calibrated Backprojection Layers (ORAL, ICCV 2021)

Other

117 stars 24 forks source link

uneven GPU memory caused by multi-gpu training #5

Closed lqzhao closed 2 years ago

lqzhao commented 2 years ago

Hi, Alex, Thanks for your nice work. I'm facing the problem of uneven GPU memories when training the model with multiple GPUs. It costs much more memory on GPU#0 than others. I think the main reason is that DataParallel can only compute losses on GPU#0. Would you give some advice to balance the GPU memory? Thanks in advance.

alexklwong commented 2 years ago

Strange, I never had that problem before. What sort of GPUs are you using? What about batch size? Is the imbalance causing an issue?

But then moving loss computation into the nn.Module doesn't quite make logical sense (since loss requires multiple images and their relative pose to compute) and also add overhead.

lqzhao commented 2 years ago

I use four 2080Ti GPUs with a batch size of 24. It's a nonnegligible issue when using a large batch size. Please see this.

alexklwong commented 2 years ago

I think if you can also do something like this in your bash:

export CUDA_VISIBLE_DEVICES=0,1

if you want to use a smaller batch size and it should allocate them onto a second GPU.

lqzhao commented 2 years ago

Thanks, you mean like this? export CUDA_VISIBLE_DEVICES=0,1; bash bash/kitti/train_kbnet_kitti.sh I tried this but it didn't work...

lqzhao commented 2 years ago

I separated the function of computing_loss as a Module and Dataparallel it. When I train with the same settings, the GPU memories are like this: It seems to be slightly alleviated...But not very promising.

alexklwong commented 2 years ago

I did some digging. I think this is the nature of PyTorch where it replicates anything that is on data parallel across GPUs. But the first GPU is still the ``master'' so it needs to hold optimizer, parameters, any operation that is not parallelized.

alexklwong commented 2 years ago

Thanks, you mean like this? export CUDA_VISIBLE_DEVICES=0,1; bash bash/kitti/train_kbnet_kitti.sh I tried this but it didn't work...

As for the above, I think you'll need to replace the export statement in the bash file https://github.com/alexklwong/calibrated-backprojection-network/blob/master/bash/kitti/train_kbnet_kitti.sh#L3

lqzhao commented 2 years ago

I did some digging. I think this is the nature of PyTorch where it replicates anything that is on data parallel across GPUs. But the first GPU is still the ``master'' so it needs to hold optimizer, parameters, any operation that is not parallelized.

Thanks for your reply. I also found that Dataparallel would cause very low training efficiency due to the cross-GPU interactions. Thus I use 2 GPUs to keep the balance between efficiency and batch size. I can easily reproduce the results of your paper. Thanks again.

alexklwong commented 2 years ago

Great, thanks, closing this issue