Closed lqzhao closed 2 years ago
Strange, I never had that problem before. What sort of GPUs are you using? What about batch size? Is the imbalance causing an issue?
It could be related to this: https://discuss.pytorch.org/t/dataparallel-imbalanced-memory-usage/22551/9
But then moving loss computation into the nn.Module doesn't quite make logical sense (since loss requires multiple images and their relative pose to compute) and also add overhead.
I use four 2080Ti GPUs with a batch size of 24. It's a nonnegligible issue when using a large batch size. Please see this.
I think if you can also do something like this in your bash:
export CUDA_VISIBLE_DEVICES=0,1
if you want to use a smaller batch size and it should allocate them onto a second GPU.
Thanks, you mean like this? export CUDA_VISIBLE_DEVICES=0,1; bash bash/kitti/train_kbnet_kitti.sh
I tried this but it didn't work...
I separated the function of computing_loss as a Module and Dataparallel it. When I train with the same settings, the GPU memories are like this: It seems to be slightly alleviated...But not very promising.
I did some digging. I think this is the nature of PyTorch where it replicates anything that is on data parallel across GPUs. But the first GPU is still the ``master'' so it needs to hold optimizer, parameters, any operation that is not parallelized.
Thanks, you mean like this?
export CUDA_VISIBLE_DEVICES=0,1; bash bash/kitti/train_kbnet_kitti.sh
I tried this but it didn't work...
As for the above, I think you'll need to replace the export statement in the bash file https://github.com/alexklwong/calibrated-backprojection-network/blob/master/bash/kitti/train_kbnet_kitti.sh#L3
I did some digging. I think this is the nature of PyTorch where it replicates anything that is on data parallel across GPUs. But the first GPU is still the ``master'' so it needs to hold optimizer, parameters, any operation that is not parallelized.
Thanks for your reply. I also found that Dataparallel would cause very low training efficiency due to the cross-GPU interactions. Thus I use 2 GPUs to keep the balance between efficiency and batch size. I can easily reproduce the results of your paper. Thanks again.
Great, thanks, closing this issue
Hi, Alex, Thanks for your nice work. I'm facing the problem of uneven GPU memories when training the model with multiple GPUs. It costs much more memory on GPU#0 than others. I think the main reason is that DataParallel can only compute losses on GPU#0. Would you give some advice to balance the GPU memory? Thanks in advance.