YOLO V4太容易OOM报错，有解决办法么，多GPU完全没法加速了

songyuanmingqing commented 4 years ago

YOLO V3 keras版本解冻前3层配置 batchsize 128乘8 解冻后 252层 32乘8，8代表GPU nums 均可以顺利训练 YOLO V3 keras版本解冻前3层只能配置 16乘8 解冻后只能1乘8 ，否则就会报错，这个有办法解决么。如果batchsize不能配置较大的话，训练一个模型50万张图片，需要好几个月，这完全没法忍受啊。具体报错：（tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[32,256,52,52] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc）

robisen1 commented 4 years ago

This usally means that data you are supplying is overwhelming the GPU. Can you tell us what GPU you have, watch batch settings you are using, if this happens during training (I assume so), and any other relevant information.

songyuanmingqing commented 4 years ago

非常感谢，我使用的是nvidia v100 GPU，上面提问时第三行有个错误，把YOLO V4写成YOLOV3了。 V4 keras版本我训练时的batchsize配置单GPU： freeze 32 unfreeze 8 ,超出时就会报错OOM。 8块GPU时 freeze 168 unfreeze 18，超过时就会报错。这是在input_shape = (608,608)情况下，在(416,416)情况下 unfreeze 可以配置为 16 或者 28 。我在使用YOLOV3训练时，8块GPU （416,416）情况下， freeze 1288 unfreeze 32*8 由于batchsize配置较大，训练速度非常快，比YOLOV4快了10倍以上。

robisen1 commented 4 years ago

Try changing your unfrozen layers batch to just 4 and you are frozen to just 2 and let me know what happens. For some reason this implementation of Yolo 4 consumes a lot of memory in training

On Sat, May 30, 2020 at 6:42 PM songyuanmingqing notifications@github.com wrote:

非常感谢，我使用的是nvidia v100 GPU，上面提问时第三行有个错误，把YOLO V4写成YOLOV3了。 V4 keras版本我训练时的batchsize配置单GPU： freeze 32 unfreeze 8 ,超出时就会报错OOM。 8块GPU时 freeze 168 unfreeze 18，超过时就会报错。这是在input_shape = (608,608)情况下，在(416,416)情况下 unfreeze 可以配置为 16 或者 28 。我在使用YOLOV3训练时，8块GPU （416,416）情况下， freeze 1288 unfreeze 32*8 由于batchsize配置较大，训练速度非常快，比YOLOV4快了10倍以上。

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Ma-Dan/keras-yolo4/issues/23#issuecomment-636408117, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSF2RLRWPYWBCJK2RK3EOTRUGYWXANCNFSM4NN4WKAQ .

songyuanmingqing commented 4 years ago

我尝试了很多次。 frozen to just 3， batchsize one GPU可以设置到32， 8 GPU设置为 4X8，更大的batchsize就会报错 OOM， unfrozen后，one GPU 必须设置8 以及以下， 8GPU必须设置1X8 或者更小，。我目前最大的问题是YOLOV3 完成一个任务的训练只需要20多个小时，因为batchsize配置的很大， YOLOV4 unfrozen后batchsize只能配置8，训练时间非常的长，40万数据1个epoch需要10个小时，通常我需要训练50个epoch， 2个月才能完成一个模型训练，

robisen1 commented 4 years ago

我尝试了很多次。 frozen to just 3， batchsize one GPU可以设置到32， 8 GPU设置为 4X8，更大的batchsize就会报错 OOM， unfrozen后，one GPU 必须设置8 以及以下， 8GPU必须设置1X8 或者更小，。我目前最大的问题是YOLOV3 完成一个任务的训练只需要20多个小时，因为batchsize配置的很大， YOLOV4 unfrozen后batchsize只能配置8，训练时间非常的长，40万数据1个epoch需要10个小时，通常我需要训练50个epoch， 2个月才能完成一个模型训练，

I understand. I too am confused about why train.py works like this. Its confusing. I am starting to look for a tensorflow - yolo4 implementation to see how it performs. Also... I wish I had your GPU's! :-)

robisen1 commented 4 years ago

我尝试了很多次。 frozen to just 3， batchsize one GPU可以设置到32， 8 GPU设置为 4X8，更大的batchsize就会报错 OOM， unfrozen后，one GPU 必须设置8 以及以下， 8GPU必须设置1X8 或者更小，。我目前最大的问题是YOLOV3 完成一个任务的训练只需要20多个小时，因为batchsize配置的很大， YOLOV4 unfrozen后batchsize只能配置8，训练时间非常的长，40万数据1个epoch需要10个小时，通常我需要训练50个epoch， 2个月才能完成一个模型训练，

I am also surprised the author of the code does not respond. He must be busy.

Ma-Dan / keras-yolo4

YOLO V4太容易OOM报错，有解决办法么，多GPU完全没法加速了 #23