AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.75k stars 7.96k forks source link

GPU memory required for training Yolo-v3 #636

Open wureka opened 6 years ago

wureka commented 6 years ago

I just trained yolo-v3 for 7 classes. My hardware is JETSON TX2. I followed the instructions to change classes and filters for each of 3 [yolo]-layers and 3[convolutional] layers before [yolo] as below

[convolutional]
filters=36

[yolo]
classes=7

I also replaced anchors of 3 [yolo]-layers with the result of ./darknet detector calc_anchors ../cfg/ai.640x384.data -num_of_clusters 9 -width 640 -heigh 384

 batch= 32
 subdivisions= 32
width=640
height=384

The training log was below:

Region 94 Avg IOU: 0.190856, Class: 0.831693, Obj: 0.542294, No Obj: 0.559379, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 106 Avg IOU: nan, Class: nan, Obj: nan, No Obj: 0.475702, .5R: nan, .75R: nan,  count: 0
Region 82 Avg IOU: nan, Class: nan, Obj: nan, No Obj: 0.487782, .5R: nan, .75R: nan,  count: 0
Region 94 Avg IOU: 0.126236, Class: 0.129014, Obj: 0.656850, No Obj: 0.560546, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 106 Avg IOU: nan, Class: nan, Obj: nan, No Obj: 0.473931, .5R: nan, .75R: nan,  count: 0
Region 82 Avg IOU: 0.155751, Class: 0.728109, Obj: 0.567061, No Obj: 0.489448, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: nan, Class: nan, Obj: nan, No Obj: 0.560938, .5R: nan, .75R: nan,  count: 0
Region 106 Avg IOU: nan, Class: nan, Obj: nan, No Obj: 0.471869, .5R: nan, .75R: nan,  count: 0
Region 82 Avg IOU: nan, Class: nan, Obj: nan, No Obj: 0.489748, .5R: nan, .75R: nan,  count: 0
Region 94 Avg IOU: nan, Class: nan, Obj: nan, No Obj: 0.561954, .5R: nan, .75R: nan,  count: 0
Region 106 Avg IOU: 0.145725, Class: 0.555193, Obj: 0.632314, No Obj: 0.472896, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 82 Avg IOU: nan, Class: nan, Obj: nan, No Obj: 0.488246, .5R: nan, .75R: nan,  count: 0
Region 94 Avg IOU: 0.392975, Class: 0.400324, Obj: 0.533230, No Obj: 0.560875, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 106 Avg IOU: nan, Class: nan, Obj: nan, No Obj: 0.472750, .5R: nan, .75R: nan,  count: 0
Region 82 Avg IOU: 0.428216, Class: 0.651273, Obj: 0.224361, No Obj: 0.488955, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: nan, Class: nan, Obj: nan, No Obj: 0.562515, .5R: nan, .75R: nan,  count: 0
Region 106 Avg IOU: nan, Class: nan, Obj: nan, No Obj: 0.472054, .5R: nan, .75R: nan,  count: 0
Region 82 Avg IOU: 0.093910, Class: 0.450770, Obj: 0.548225, No Obj: 0.489352, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: nan, Class: nan, Obj: nan, No Obj: 0.558836, .5R: nan, .75R: nan,  count: 0
Region 106 Avg IOU: 0.314969, Class: 0.234830, Obj: 0.538750, No Obj: 0.473070, .5R: 0.000000, .75R: 0.000000,  count: 1

 2: 3126.717285, 3125.974609 avg, 0.000000 rate, 31.244499 seconds, 64 images

And from JETSON TX2's log, I found that yolov3 uses GPU memory over 4GB

RAM 5846/7851MB (lfb 86x4MB) cpu [0%@345,0%@345,0%@345,1%@345,0%@345,45%@345]
RAM 5846/7851MB (lfb 86x4MB) cpu [21%@806,0%@345,0%@345,6%@806,0%@806,9%@806]
RAM 5846/7851MB (lfb 86x4MB) cpu [0%@345,0%@345,0%@345,0%@345,25%@345,0%@345]
RAM 5846/7851MB (lfb 86x4MB) cpu [1%@345,0%@345,0%@345,0%@345,34%@345,0%@345]
RAM 5846/7851MB (lfb 86x4MB) cpu [7%@345,0%@345,0%@345,33%@345,0%@345,6%@345]
RAM 5846/7851MB (lfb 86x4MB) cpu [0%@345,0%@345,0%@345,0%@345,0%@345,35%@345]
RAM 5846/7851MB (lfb 86x4MB) cpu [0%@345,0%@345,0%@345,0%@345,0%@345,49%@345]

My questions are:

  1. from training log, why are some Avg IOU's value nan?
  2. According to your instructions, you say yolo v3 requires 4GB GPU RAM. buy why does it take over 4GB on my environment? Are there any configuration I could miss ? Thanks.
AlexeyAB commented 6 years ago
  1. nan occurs when there isn't labels (box truth) for anchors in this [yolo]-layer (anchor= which are specified in mask= in cfg-file for the current one of three yolo-layers)

  2. nan occurs when there isn't labels for this image at all (negative samples).

Only if nan occurs for avg loss for several dozen consecutive iterations, then training went wrong. Otherwise, the training goes well.


Yolo does in a loop for each Ground truth labels:

  1. Yolo looking for the most suitable anchor for the current Ground truth labels (from all 9 anchors): https://github.com/AlexeyAB/darknet/blob/5c1e8e3f48343d8944af1195e21f6f3b53ed848e/src/yolo_layer.c#L243-L245
  2. Then Yolo check is this the best anchors specified in mask= of this [yolo]-layers (in total there are 3 yolo-layers): https://github.com/AlexeyAB/darknet/blob/5c1e8e3f48343d8944af1195e21f6f3b53ed848e/src/yolo_layer.c#L249
  3. And if yes, then calculates these indicators - i.e. calculates sum and divide it on number of Ground truth labels which suitable for anchor from this [yolo]-layer:

Conclusion: So if the best anchor isn't suitable for this [yolo]-layer, then count of Ground truth labels which suitable for this [yolo]-layer is equal to zero (count=0), then will be divide by zero

Region 82 Avg IOU: 0.093910, Class: 0.450770, Obj: 0.548225, No Obj: 0.489352, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: nan, Class: nan, Obj: nan, No Obj: 0.558836, .5R: nan, .75R: nan,  count: 0
Region 106 Avg IOU: 0.314969, Class: 0.234830, Obj: 0.538750, No Obj: 0.473070, .5R: 0.000000, .75R: 0.000000,  count: 1

But this indicator calculated in any cases:

AlexeyAB commented 6 years ago

According to your instructions, you say yolo v3 requires 4GB GPU RAM. buy why does it take over 4GB on my environment? Are there any configuration I could miss ?

Yolo v3 requires 4 GB GPU RAM + ~2 GB CPU RAM ~= 6 GB RAM.

But Jetson TX2 has only one type of memory LPDDR4 that is shared across CPU and GPU, so total spent 6 GB RAM: https://devtalk.nvidia.com/default/topic/1002349/jetson-tx2/jetson-tx2-gpu-memory-/

wureka commented 6 years ago

@AlexeyAB Thank you for your answer. Regarding the portion of nan you mentioned above:

  1. nan occurs when there isn't labels (box truth) for anchors in this [yolo]-layer (anchor= which are specified in mask= in cfg-file for the current one of three yolo-layers)
  2. nan occurs when there isn't labels for this image at all (negative samples).

With the same image dataset, if I trained with Yolo v2, there was no nan appearing. So, could I say that my image dataset and relevant stuff such as labels have no problems and can be trained by Yolo v3 ?

AlexeyAB commented 6 years ago

So, could I say that my image dataset and relevant stuff such as labels have no problems and can be trained by Yolo v3 ?

Yes, if you can train about 6000 iterations and can get good mAP.

anguoyang commented 6 years ago

@wureka according to the instruction of @AlexeyAB , the width and height should be multiple of 32, is it really no problem when you training on your dataset? My training dataset is 640x480, but I used 416x416 in the cfg file, I am not sure which one is better(use the original height width or multiple of 32)

wureka commented 6 years ago

@anguoyang all images of my dataset is 640x384, which are also the multiple of 32, and I also change the width and height in cfg to 640x384 classes, filters, and anchors. So, I think they should be no problem

TaihuLight commented 6 years ago

When I train YOLOv3 on VOC2012+2007, and only set the different parameters as follow -Training , takes up 6906MiB GPU RAM batch=128 subdivisions=32

-Training only takes up 3876MiB GPU RAM, and lower than above setting, why? batch=256 subdivisions=32

@AlexeyAB @wureka @anguoyang What is wrong?

 7: 597.089417, 597.692444 avg, 0.000000 rate, 13.913467 seconds, 1792 images
Loaded: 0.000018 seconds
Region 82 Avg IOU: 0.464548, Class: 0.635690, Obj: 0.374416, No Obj: 0.521148, .5R: 0.500000, .75R: 0.000000,  count: 6
Region 94 Avg IOU: 0.392822, Class: 0.381941, Obj: 0.574848, No Obj: 0.554483, .5R: 0.272727, .75R: 0.000000,  count: 11
Region 106 Avg IOU: 0.367993, Class: 0.437119, Obj: 0.463624, No Obj: 0.506273, .5R: 0.250000, .75R: 0.000000,  count: 4
Region 82 Avg IOU: 0.325769, Class: 0.561967, Obj: 0.457437, No Obj: 0.519700, .5R: 0.250000, .75R: 0.000000,  count: 4
Region 94 Avg IOU: 0.217476, Class: 0.340199, Obj: 0.570449, No Obj: 0.554392, .5R: 0.000000, .75R: 0.000000,  count: 2
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.506313, .5R: -nan, .75R: -nan,  count: 0
Region 82 Avg IOU: 0.337817, Class: 0.508326, Obj: 0.555578, No Obj: 0.520447, .5R: 0.166667, .75R: 0.000000,  count: 6
Region 94 Avg IOU: 0.379297, Class: 0.529548, Obj: 0.584318, No Obj: 0.554746, .5R: 0.000000, .75R: 0.000000,  count: 3
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.506114, .5R: -nan, .75R: -nan,  count: 0
AlexeyAB commented 6 years ago

@TaihuLight

Do you train with random=1? Did you see this message CUDNN-slow during training?

If there is not enough GPU-memory, then will be used slower cuDNN algorithm that doesn't require extra spaces on GPU.

tigerdhl commented 6 years ago

@AlexeyAB thank you for the analysis,I have two question: 1、is normal when I get the follow train information? I test the model, the result is no good. 901: 76.827377, 80.377640 avg, 0.000000 rate, 2.812252 seconds, 57664 images Loaded: 0.000042 seconds Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.231459, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.161805, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.095552, Class: 0.450664, Obj: 0.045965, No Obj: 0.066019, .5R: 0.000000, .75R: 0.000000, count: 15 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.226805, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.158227, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.074553, Class: 0.405912, Obj: 0.055198, No Obj: 0.065306, .5R: 0.055556, .75R: 0.000000, count: 18 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.225061, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.158069, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.134379, Class: 0.397500, Obj: 0.044243, No Obj: 0.064785, .5R: 0.052632, .75R: 0.000000, count: 19 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.229917, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.161682, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.137914, Class: 0.389384, Obj: 0.029701, No Obj: 0.066656, .5R: 0.041667, .75R: 0.000000, count: 24 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.243717, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: 0.000000, Class: 0.639533, Obj: 0.245265, No Obj: 0.169110, .5R: 0.000000, .75R: 0.000000, count: 1 Region 106 Avg IOU: 0.103062, Class: 0.269671, Obj: 0.046545, No Obj: 0.069866, .5R: 0.047619, .75R: 0.000000, count: 21 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.224353, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.156570, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.050303, Class: 0.395995, Obj: 0.066860, No Obj: 0.064108, .5R: 0.000000, .75R: 0.000000, count: 14 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.257895, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.172058, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.095085, Class: 0.452181, Obj: 0.050280, No Obj: 0.068527, .5R: 0.000000, .75R: 0.000000, count: 14 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.232374, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.159680, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.066052, Class: 0.403133, Obj: 0.071710, No Obj: 0.065342, .5R: 0.000000, .75R: 0.000000, count: 14 902: 78.717804, 80.211655 avg, 0.000000 rate, 3.113250 seconds, 57728 images Loaded: 0.000060 seconds Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.227541, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.157643, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.172302, Class: 0.317470, Obj: 0.042698, No Obj: 0.064021, .5R: 0.105263, .75R: 0.000000, count: 19 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.224071, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.154533, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.096896, Class: 0.497491, Obj: 0.070076, No Obj: 0.063530, .5R: 0.000000, .75R: 0.000000, count: 16 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.225646, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.156641, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.159819, Class: 0.372564, Obj: 0.041905, No Obj: 0.065208, .5R: 0.000000, .75R: 0.000000, count: 23 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.227719, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.158139, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.154154, Class: 0.412679, Obj: 0.054324, No Obj: 0.065964, .5R: 0.150000, .75R: 0.000000, count: 20 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.225663, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: 0.048717, Class: 0.399206, Obj: 0.119354, No Obj: 0.154785, .5R: 0.000000, .75R: 0.000000, count: 3 Region 106 Avg IOU: 0.085906, Class: 0.380880, Obj: 0.041322, No Obj: 0.063836, .5R: 0.000000, .75R: 0.000000, count: 16 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.227509, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.155948, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.085052, Class: 0.486943, Obj: 0.049302, No Obj: 0.065788, .5R: 0.052632, .75R: 0.052632, count: 19 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.224697, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.154308, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.086668, Class: 0.358643, Obj: 0.081979, No Obj: 0.067230, .5R: 0.083333, .75R: 0.000000, count: 12 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.228932, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.156854, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.064096, Class: 0.496937, Obj: 0.096645, No Obj: 0.064083, .5R: 0.000000, .75R: 0.000000, count: 15 903: 75.638824, 79.754372 avg, 0.000000 rate, 3.062053 seconds, 57792 images

2、can I set the "mask" in all of the three yolo layer with 0 1 2 3 4 5 6 7 8, no just 0 1 2 or 3 4 5 or 6 7 8?

AlexeyAB commented 6 years ago

@tigerdhl

  1. This is normal
  2. Yes you can. But there is no guarantee that it will increase accuracy.
viethoang303 commented 2 years ago

@AlexeyAB thank you for this analysis, so how to fix this??? it's simply training with multiple epochs, right????. Maybe I'm wrong