Abnormal GPU memory use in training CondInst

aim-uofa / AdelaiDet

AdelaiDet is an open source toolbox for multiple instance-level detection and recognition tasks.

https://git.io/AdelaiDet

Other

3.38k stars 650 forks source link

Abnormal GPU memory use in training CondInst #376

Closed jiafeier closed 3 years ago

jiafeier commented 3 years ago

I am using CondInst for instance segmentation task. The batchsize is 3 with 3 GPUs, I evaluate the model after a certain number of iterations, in the training stage, this is the use of GPU memory, which I think is normal.

However, in the evaluation stage, the GPU memory is quite abnormal and soon it appears "CUDA out of memory".

I also train Mask R-CNN model, the batchsize is 6 while the training is going normally. I would appreciate it that you can give me some suggestions about this, is this normal? or what can I do to solve this problem? Thank you very much! @tianzhi0549

hustzlb commented 3 years ago

感觉显存占的好高啊，昨晚我把batchsize设置成24，也是三个gpu，显存才占10g左右。

jiafeier commented 3 years ago

@hustzlb 训练时候GPU占用我觉得比较正常，图像是自己的，分辨率是2500x1800的，训练时候我设置的resize为1080x800的。但评估时候这个GPU占用我感觉太异常了，不知道是哪儿的问题，Mask R-CNN我同样设置，batchsize设置为6，评估的时候都没有超，就感觉比较困惑。想问下您显存占10G，是三块加起来10G，还是每块显存占用为10G，方便的话，想知道下您数据集图像大小以及您的GPU在训练与评估时的使用情况，谢谢！

hustzlb commented 3 years ago

@hustzlb 训练时候GPU占用我觉得比较正常，图像是自己的，分辨率是2500x1800的，训练时候我设置的resize为1080x800的。但评估时候这个GPU占用我感觉太异常了，不知道是哪儿的问题，Mask R-CNN我同样设置，batchsize设置为6，评估的时候都没有超，就感觉比较困惑。想问下您显存占10G，是三块加起来10G，还是每块显存占用为10G，方便的话，想知道下您数据集图像大小以及您的GPU在训练与评估时的使用情况，谢谢！

我的显卡是2080Ti，应该是11G显存的，我的图像是2048x2048，我的数据集大概不到200张，我是先试验一下这个模型怎么跑，就是没怎么跑明白

jiafeier commented 3 years ago

在训练结束后，还会有这么一句警告，有大神知道是什么意思吗？批注 2021-05-31 151549

tianzhi0549 commented 3 years ago

Hi all, if you encounter out-of-memory errors, please first disable MODEL.CONDINST.MAX_PROPOSALS by adding MODEL.CONDINST.MAX_PROPOSALS -1 to your command line, and then use MODEL.CONDINST.TOPK_PROPOSALS_PER_IM by adding MODEL.CONDINST.TOPK_PROPOSALS_PER_IM 64 to the command line.

For example,

OMP_NUM_THREADS=1 python tools/train_net.py \
    --config-file configs/CondInst/MS_R_50_1x.yaml \
    --num-gpus 8 \
    OUTPUT_DIR training_dir/CondInst_MS_R_50_1x \
   MODEL.CONDINST.MAX_PROPOSALS -1 \
   MODEL.CONDINST.TOPK_PROPOSALS_PER_IM 64

If you still have the errors, please reduce MODEL.CONDINST.TOPK_PROPOSALS_PER_IM.

jiafeier commented 3 years ago

Thank you very much for your detailed solutions! @tianzhi0549 Unfortunately, even the MODEL.CONDINST.TOPK_PROPOSALS_PER_IM is set as 8, once the model is evaluated, the CUDA OOM still appears. Here I discover that every time this error appears, the last two lines are the same. I read the code and discover that the function in https://github.com/aim-uofa/AdelaiDet/blob/master/adet/modeling/condinst/condinst.py line 324 is similar but different from that in detectron2 https://github.com/facebookresearch/detectron2/blob/master/detectron2/modeling/postprocessing.py. The original resolution of my test image is 4000x3000, 1080x800 after resizing. I think the function and resolution may be the reason why the CUDA OOM appears. However, I am not sure if I am right. I would appreciate it that if you can give a brief explanation about the difference between the two functions, and let me know if I am thinking in the right direction.

tianzhi0549 commented 3 years ago

@jiafeier Yes, you are right. I suggest you move this operation to cpu instead, or you can use the function retry_if_cuda_oom in Detectron2.

jiafeier commented 3 years ago

Thank you for your patient explanations! @tianzhi0549 After I use the retry_if_cuda_oom, the training is going normally. However, here I have the last question about this issue, sometimes in the evaluation stage, this error appears. RuntimeError: non-empty 3D or 4D (batch mode) tensor expected for input, but got: [ torch.cuda.FloatTensor{0,1,270,200} ] I find a similar issue, #284, you suggest skipping these images, can you give some details about how to skip these images since this error can interrupt the training process. Thank you!

tianzhi0549 commented 3 years ago

@jiafeier This error is because the first dimension of the tensor is 0. In other words, the model does not make any mask predictions for this image. So, you can just check whether or not the first dimension is 0, and if yes, skip the subsequent codes with the if statement.

jiafeier commented 3 years ago

Thank you very much for your help on my issue. @tianzhi0549 I will try as you suggest.