Closed jiafeier closed 3 years ago
感觉显存占的好高啊,昨晚我把batchsize设置成24,也是三个gpu,显存才占10g左右。
@hustzlb 训练时候GPU占用我觉得比较正常,图像是自己的,分辨率是2500x1800的,训练时候我设置的resize为1080x800的。但评估时候这个GPU占用我感觉太异常了,不知道是哪儿的问题,Mask R-CNN我同样设置,batchsize设置为6,评估的时候都没有超,就感觉比较困惑。 想问下您显存占10G,是三块加起来10G,还是每块显存占用为10G,方便的话,想知道下您数据集图像大小以及您的GPU在训练与评估时的使用情况,谢谢!
@hustzlb 训练时候GPU占用我觉得比较正常,图像是自己的,分辨率是2500x1800的,训练时候我设置的resize为1080x800的。但评估时候这个GPU占用我感觉太异常了,不知道是哪儿的问题,Mask R-CNN我同样设置,batchsize设置为6,评估的时候都没有超,就感觉比较困惑。 想问下您显存占10G,是三块加起来10G,还是每块显存占用为10G,方便的话,想知道下您数据集图像大小以及您的GPU在训练与评估时的使用情况,谢谢!
我的显卡是2080Ti,应该是11G显存的,我的图像是2048x2048,我的数据集大概不到200张,我是先试验一下这个模型怎么跑,就是没怎么跑明白
在训练结束后,还会有这么一句警告,有大神知道是什么意思吗?
Hi all, if you encounter out-of-memory errors, please first disable MODEL.CONDINST.MAX_PROPOSALS
by adding MODEL.CONDINST.MAX_PROPOSALS -1
to your command line, and then use MODEL.CONDINST.TOPK_PROPOSALS_PER_IM
by adding MODEL.CONDINST.TOPK_PROPOSALS_PER_IM 64
to the command line.
For example,
OMP_NUM_THREADS=1 python tools/train_net.py \
--config-file configs/CondInst/MS_R_50_1x.yaml \
--num-gpus 8 \
OUTPUT_DIR training_dir/CondInst_MS_R_50_1x \
MODEL.CONDINST.MAX_PROPOSALS -1 \
MODEL.CONDINST.TOPK_PROPOSALS_PER_IM 64
If you still have the errors, please reduce MODEL.CONDINST.TOPK_PROPOSALS_PER_IM
.
Thank you very much for your detailed solutions! @tianzhi0549 Unfortunately, even the MODEL.CONDINST.TOPK_PROPOSALS_PER_IM is set as 8, once the model is evaluated, the CUDA OOM still appears. Here I discover that every time this error appears, the last two lines are the same. I read the code and discover that the function in https://github.com/aim-uofa/AdelaiDet/blob/master/adet/modeling/condinst/condinst.py line 324 is similar but different from that in detectron2 https://github.com/facebookresearch/detectron2/blob/master/detectron2/modeling/postprocessing.py. The original resolution of my test image is 4000x3000, 1080x800 after resizing. I think the function and resolution may be the reason why the CUDA OOM appears. However, I am not sure if I am right. I would appreciate it that if you can give a brief explanation about the difference between the two functions, and let me know if I am thinking in the right direction.
@jiafeier Yes, you are right. I suggest you move this operation to cpu
instead, or you can use the function retry_if_cuda_oom
in Detectron2
.
Thank you for your patient explanations! @tianzhi0549
After I use the retry_if_cuda_oom, the training is going normally.
However, here I have the last question about this issue, sometimes in the evaluation stage, this error appears.
RuntimeError: non-empty 3D or 4D (batch mode) tensor expected for input, but got: [ torch.cuda.FloatTensor{0,1,270,200} ]
I find a similar issue, #284, you suggest skipping these images, can you give some details about how to skip these images since this error can interrupt the training process. Thank you!
@jiafeier This error is because the first dimension of the tensor is 0. In other words, the model does not make any mask predictions for this image. So, you can just check whether or not the first dimension is 0, and if yes, skip the subsequent codes with the if
statement.
Thank you very much for your help on my issue. @tianzhi0549 I will try as you suggest.
I am using CondInst for instance segmentation task. The batchsize is 3 with 3 GPUs, I evaluate the model after a certain number of iterations, in the training stage, this is the use of GPU memory, which I think is normal.
However, in the evaluation stage, the GPU memory is quite abnormal and soon it appears "CUDA out of memory".
I also train Mask R-CNN model, the batchsize is 6 while the training is going normally. I would appreciate it that you can give me some suggestions about this, is this normal? or what can I do to solve this problem? Thank you very much! @tianzhi0549