Open wureka opened 6 years ago
nan
occurs when there isn't labels (box truth) for anchors in this [yolo]
-layer (anchor=
which are specified in mask=
in cfg-file for the current one of three yolo-layers)
nan
occurs when there isn't labels for this image at all (negative samples).
Only if nan
occurs for avg loss
for several dozen consecutive iterations, then training went wrong. Otherwise, the training goes well.
Yolo does in a loop for each Ground truth labels:
mask=
of this [yolo]
-layers (in total there are 3 yolo-layers): https://github.com/AlexeyAB/darknet/blob/5c1e8e3f48343d8944af1195e21f6f3b53ed848e/src/yolo_layer.c#L249[yolo]
-layer:Region 82
- index of current [yolo]-layer (in the default yolov3.cfg
there are 3 yolo layers: 82, 94, 106)Avg IOU
- is average intersect of union between Detected objects and Ground truth from label-txt-fileClass:
- is average of the probabilities of correctly classified objectsObj:
- is average of the objectness T0
(probabilities that there is an object in this box (anchor)).5R:
- is average true positives with (IoU > 0.5).75R:
- is average true positives with (IoU > 0.75)Conclusion: So if the best anchor isn't suitable for this [yolo]
-layer, then count
of Ground truth labels which suitable for this [yolo]
-layer is equal to zero (count=0
), then will be divide by zero
Avg IOU
= nan
Class
= nan
Obj
= nan
Region 82 Avg IOU: 0.093910, Class: 0.450770, Obj: 0.548225, No Obj: 0.489352, .5R: 0.000000, .75R: 0.000000, count: 1
Region 94 Avg IOU: nan, Class: nan, Obj: nan, No Obj: 0.558836, .5R: nan, .75R: nan, count: 0
Region 106 Avg IOU: 0.314969, Class: 0.234830, Obj: 0.538750, No Obj: 0.473070, .5R: 0.000000, .75R: 0.000000, count: 1
But this indicator calculated in any cases:
No Obj:
- is average of the probabilities of all objects (both correctly and not correctly classified)According to your instructions, you say yolo v3 requires 4GB GPU RAM. buy why does it take over 4GB on my environment? Are there any configuration I could miss ?
Yolo v3 requires 4 GB GPU RAM + ~2 GB CPU RAM ~= 6 GB RAM.
But Jetson TX2 has only one type of memory LPDDR4 that is shared across CPU and GPU, so total spent 6 GB RAM: https://devtalk.nvidia.com/default/topic/1002349/jetson-tx2/jetson-tx2-gpu-memory-/
@AlexeyAB Thank you for your answer. Regarding the portion of nan you mentioned above:
With the same image dataset, if I trained with Yolo v2, there was no nan appearing. So, could I say that my image dataset and relevant stuff such as labels have no problems and can be trained by Yolo v3 ?
So, could I say that my image dataset and relevant stuff such as labels have no problems and can be trained by Yolo v3 ?
Yes, if you can train about 6000 iterations and can get good mAP.
@wureka according to the instruction of @AlexeyAB , the width and height should be multiple of 32, is it really no problem when you training on your dataset? My training dataset is 640x480, but I used 416x416 in the cfg file, I am not sure which one is better(use the original height width or multiple of 32)
@anguoyang all images of my dataset is 640x384, which are also the multiple of 32, and I also change the width and height in cfg to 640x384 classes, filters, and anchors. So, I think they should be no problem
When I train YOLOv3 on VOC2012+2007, and only set the different parameters as follow -Training , takes up 6906MiB GPU RAM batch=128 subdivisions=32
-Training only takes up 3876MiB GPU RAM, and lower than above setting, why? batch=256 subdivisions=32
@AlexeyAB @wureka @anguoyang What is wrong?
7: 597.089417, 597.692444 avg, 0.000000 rate, 13.913467 seconds, 1792 images
Loaded: 0.000018 seconds
Region 82 Avg IOU: 0.464548, Class: 0.635690, Obj: 0.374416, No Obj: 0.521148, .5R: 0.500000, .75R: 0.000000, count: 6
Region 94 Avg IOU: 0.392822, Class: 0.381941, Obj: 0.574848, No Obj: 0.554483, .5R: 0.272727, .75R: 0.000000, count: 11
Region 106 Avg IOU: 0.367993, Class: 0.437119, Obj: 0.463624, No Obj: 0.506273, .5R: 0.250000, .75R: 0.000000, count: 4
Region 82 Avg IOU: 0.325769, Class: 0.561967, Obj: 0.457437, No Obj: 0.519700, .5R: 0.250000, .75R: 0.000000, count: 4
Region 94 Avg IOU: 0.217476, Class: 0.340199, Obj: 0.570449, No Obj: 0.554392, .5R: 0.000000, .75R: 0.000000, count: 2
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.506313, .5R: -nan, .75R: -nan, count: 0
Region 82 Avg IOU: 0.337817, Class: 0.508326, Obj: 0.555578, No Obj: 0.520447, .5R: 0.166667, .75R: 0.000000, count: 6
Region 94 Avg IOU: 0.379297, Class: 0.529548, Obj: 0.584318, No Obj: 0.554746, .5R: 0.000000, .75R: 0.000000, count: 3
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.506114, .5R: -nan, .75R: -nan, count: 0
@TaihuLight
Do you train with random=1
?
Did you see this message CUDNN-slow
during training?
If there is not enough GPU-memory, then will be used slower cuDNN algorithm that doesn't require extra spaces on GPU.
@AlexeyAB thank you for the analysis,I have two question: 1、is normal when I get the follow train information? I test the model, the result is no good. 901: 76.827377, 80.377640 avg, 0.000000 rate, 2.812252 seconds, 57664 images Loaded: 0.000042 seconds Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.231459, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.161805, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.095552, Class: 0.450664, Obj: 0.045965, No Obj: 0.066019, .5R: 0.000000, .75R: 0.000000, count: 15 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.226805, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.158227, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.074553, Class: 0.405912, Obj: 0.055198, No Obj: 0.065306, .5R: 0.055556, .75R: 0.000000, count: 18 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.225061, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.158069, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.134379, Class: 0.397500, Obj: 0.044243, No Obj: 0.064785, .5R: 0.052632, .75R: 0.000000, count: 19 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.229917, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.161682, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.137914, Class: 0.389384, Obj: 0.029701, No Obj: 0.066656, .5R: 0.041667, .75R: 0.000000, count: 24 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.243717, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: 0.000000, Class: 0.639533, Obj: 0.245265, No Obj: 0.169110, .5R: 0.000000, .75R: 0.000000, count: 1 Region 106 Avg IOU: 0.103062, Class: 0.269671, Obj: 0.046545, No Obj: 0.069866, .5R: 0.047619, .75R: 0.000000, count: 21 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.224353, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.156570, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.050303, Class: 0.395995, Obj: 0.066860, No Obj: 0.064108, .5R: 0.000000, .75R: 0.000000, count: 14 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.257895, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.172058, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.095085, Class: 0.452181, Obj: 0.050280, No Obj: 0.068527, .5R: 0.000000, .75R: 0.000000, count: 14 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.232374, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.159680, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.066052, Class: 0.403133, Obj: 0.071710, No Obj: 0.065342, .5R: 0.000000, .75R: 0.000000, count: 14 902: 78.717804, 80.211655 avg, 0.000000 rate, 3.113250 seconds, 57728 images Loaded: 0.000060 seconds Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.227541, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.157643, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.172302, Class: 0.317470, Obj: 0.042698, No Obj: 0.064021, .5R: 0.105263, .75R: 0.000000, count: 19 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.224071, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.154533, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.096896, Class: 0.497491, Obj: 0.070076, No Obj: 0.063530, .5R: 0.000000, .75R: 0.000000, count: 16 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.225646, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.156641, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.159819, Class: 0.372564, Obj: 0.041905, No Obj: 0.065208, .5R: 0.000000, .75R: 0.000000, count: 23 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.227719, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.158139, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.154154, Class: 0.412679, Obj: 0.054324, No Obj: 0.065964, .5R: 0.150000, .75R: 0.000000, count: 20 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.225663, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: 0.048717, Class: 0.399206, Obj: 0.119354, No Obj: 0.154785, .5R: 0.000000, .75R: 0.000000, count: 3 Region 106 Avg IOU: 0.085906, Class: 0.380880, Obj: 0.041322, No Obj: 0.063836, .5R: 0.000000, .75R: 0.000000, count: 16 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.227509, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.155948, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.085052, Class: 0.486943, Obj: 0.049302, No Obj: 0.065788, .5R: 0.052632, .75R: 0.052632, count: 19 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.224697, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.154308, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.086668, Class: 0.358643, Obj: 0.081979, No Obj: 0.067230, .5R: 0.083333, .75R: 0.000000, count: 12 Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.228932, .5R: -nan, .75R: -nan, count: 0 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.156854, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: 0.064096, Class: 0.496937, Obj: 0.096645, No Obj: 0.064083, .5R: 0.000000, .75R: 0.000000, count: 15 903: 75.638824, 79.754372 avg, 0.000000 rate, 3.062053 seconds, 57792 images
2、can I set the "mask" in all of the three yolo layer with 0 1 2 3 4 5 6 7 8, no just 0 1 2 or 3 4 5 or 6 7 8?
@tigerdhl
@AlexeyAB thank you for this analysis, so how to fix this??? it's simply training with multiple epochs, right????. Maybe I'm wrong
I just trained yolo-v3 for 7 classes. My hardware is JETSON TX2. I followed the instructions to change classes and filters for each of 3 [yolo]-layers and 3[convolutional] layers before [yolo] as below
I also replaced anchors of 3 [yolo]-layers with the result of
./darknet detector calc_anchors ../cfg/ai.640x384.data -num_of_clusters 9 -width 640 -heigh 384
The training log was below:
And from JETSON TX2's log, I found that yolov3 uses GPU memory over 4GB
My questions are: