Longer training Iteration doesnt detect previous learned class anymore?!

Hi,

My problem: Longer training Iteration doesnt detect previous learned class anymore?!

Maybe @AlexeyAB got any hint for me again why this happens?

I've currently training 39 Classes with the yolov3.cfg on a 4x GPU (GeForce GTX 1080 TI) system where i've changed the values of batch, subdivisions, width, height, classes, max_batch_size, burn_in, flip, max_batches, learning_rate, steps, filters and anchors (./darknet detector calc_anchors data/soccer_Jan2020.data -num_of_clusters 18 -width 608 -height 608) via the howto provided by @AlexeyAB

https://github.com/AlexeyAB/darknet#how-to-train-with-multi-gpu

I've added the anchors to the cfg file and also the filters, steps, num=18 on every of the 3 [yolo] layers.

Dataset: I use 1500 up to 11.000 samples for each of the 39 classes. Bottles got 11.248 samples, cables got 1792 samples. 80% for training, 20% for testing.

At iteration 18000 the Model seems to detect the Cable: thrash0

At the next iteration it doesnt know the cable anymore: thrash2

At iteration 38.000 it seems to detect the bottles: thrash1

But later on it doesnt detect the bottles and the cables anymore.

I dont know why that is, anyone got any suggestions for me? Also the "*_best.weights" doesnt seem to know cable and bottles. Do i need to wait further iterations and cross the fingers that at the final.weights file all my classes will be detected or is there anything wrong with my training?

My Training Chart looks very promissing:

training_1581327011_chart

Here is my current training *.cfg file:

[net] batch=64 subdivisions=32 width=608 height=608 channels=3 momentum=0.9 decay=0.0005 angle=0 saturation = 1.5 exposure = 1.5 hue=.1 flip=0 learning_rate=0.00025 burn_in=4000 max_batches = 312000 policy=steps steps=249600,280800 scales=.1,.1 ... ... ... [yolo] mask = 12,13,14,15,16,17 anchors = 8,12, 13,16, 8,38, 20,26, 15,56, 32,41, 24,84, 72,66, 151,34, 16,350, 94,114, 149,78, 61,336, 153,140, 258,101, 221,182, 388,168, 364,405 classes=39 num=18 jitter=.3 ignore_thresh = .7 truth_thresh = 1 random=1 ... ... ... [yolo] mask = 6,7,8,9,10,11 anchors = 8,12, 13,16, 8,38, 20,26, 15,56, 32,41, 24,84, 72,66, 151,34, 16,350, 94,114, 149,78, 61,336, 153,140, 258,101, 221,182, 388,168, 364,405 classes=39 num=18 jitter=.3 ignore_thresh = .7 truth_thresh = 1 random=1 ... ... ... [yolo] mask = 0,1,2,3,4,5 anchors = 8,12, 13,16, 8,38, 20,26, 15,56, 32,41, 24,84, 72,66, 151,34, 16,350, 94,114, 149,78, 61,336, 153,140, 258,101, 221,182, 388,168, 364,405 classes=39 num=18 jitter=.3 ignore_thresh = .7 truth_thresh = 1 random=1

You should train while mAP increases, at least ~80 000 - 100 000 iterations and then check results.

If it doesn't help, then try to set

[yolo]
counters_per_class = 100, 50, 200, ...

with 39 values, with number of objects of each class, to try to solve imbalance issue. And train again.

Hey @AlexeyAB ,

I've added the counters_per_class to every of the 3x YOLO Layers in the cfg file, with the amount of objects for each class which i've got for the training. My cfg looks like the one posted in the starting post just with added counters_per_class

[yolo] counters_per_class = 10808, 1155, 10575, 2665, 1103, 8392, 8272, 2129, 1515, 4578, 4252, 456, 3759, 4494, 4153, 4531, 4558, 3998, 3756, 3655, 3347, 3735, 2705, 1240, 3068, 3375, 3048, 2393, 4298, 3945, 4065, 3653, 3768, 2897, 1509, 3387, 3758, 3382, 2692 [yolo] counters_per_class = 10808, 1155, 10575, 2665, 1103, 8392, 8272, 2129, 1515, 4578, 4252, 456, 3759, 4494, 4153, 4531, 4558, 3998, 3756, 3655, 3347, 3735, 2705, 1240, 3068, 3375, 3048, 2393, 4298, 3945, 4065, 3653, 3768, 2897, 1509, 3387, 3758, 3382, 2692 [yolo] counters_per_class = 10808, 1155, 10575, 2665, 1103, 8392, 8272, 2129, 1515, 4578, 4252, 456, 3759, 4494, 4153, 4531, 4558, 3998, 3756, 3655, 3347, 3735, 2705, 1240, 3068, 3375, 3048, 2393, 4298, 3945, 4065, 3653, 3768, 2897, 1509, 3387, 3758, 3382, 2692

My Training chart now looks like this (I've trained now for 200.000 iterations in ~11 days): new_chart

And still I've got the same problem as described in my initial post of this thread :-(

I've also checked all my annotations for the classes 0,1,2,3,4,7 where i've got the problems with and every annotation looks fine (that took quite a while checking every single annotation manualy).

(class 0 = 10808 annotations, class 1 = 1155 annotations, class 2 = 10575 annotations, class 3 = 2665 annotations, class 4 = 1103 annotations, class 7 = 2129 annotations).

Maybe you have any kind of hint for me why this happens, every other class will be detected fine and i don't know why, i dont get it why it dont detect these objects, hope you can help me somehow :(

Is this mAP=97% on the chart for Validation dataset?

Is your image in the post from Validation dataset?

What mAP do you get on Training and Validation dataset?

Did you try to detect by using -thresh 0.15 at the end of command?

@AlexeyAB

Is this mAP=97% on the chart for Validation dataset?

The 97% mAP is the chart.png which is generated by the training process:

./darknet detector train "data/mymodel.data" "cfg/mymodel.cfg" "/srv/storage/training/YOLO/mymodel/mymodel_last.weights" darknet53.conv.74 -dont_show -map -gpus 0,1,2,3

Is your image in the post from Validation dataset?

The image in the post is generated by slightly modfied "./darknet_video.py" (mostly cvDrawBoxes function is modified) provided from your package and utilized the mymodel_last.weights file from the current training. I use cv2.imwrite to save the image, with the drawn boxes on it, when a specific frame is hitted in in the video. Images/Frames, of the Video i use to execute yolo on, arent part of of my training/testing dataset, the annotations are made of extracted frames from other videos.

What mAP do you get on Training and Validation dataset?

With which command can i get the mAP for the Training / Validation dataset? With this command:

./darknet detector map data/mymodel.data cfg/mymodel.cfg /srv/storage/training/YOLO/mymodel/mymodel_last.weights I get for the classes i the following values (the detection problems i've got are with class_ids = 0, 1, 2, 3, 4, 7, the other classes are detected awesomly perfect):

classes_multipliers: 1.0, 9.4, 1.0, 4.1, 9.8, 1.3, 1.3, 5.1, 7.1, 2.4, 2.5, 23.7, 2.9, 2.4, 2.6, 2.4, 2.4, 2.7, 2.9, 3.0, 3.2, 2.9, 4.0, 8.7, 3.5, 3.2, 3.5, 4.5, 2.5, 2.7, 2.7, 3.0, 2.9, 3.7, 7.2, 3.2, 2.9, 3.2, 4.0, [yolo] params: iou loss: mse (2), iou_norm: 0.75, cls_norm: 1.00, scale_x_y: 1.00 Total BFLOPS 140.769 Allocate additional workspace_size = 52.43 MB Loading weights from /srv/storage/training/YOLO/mymodel/mymodel_last.weights... seen 64, trained: 12891 K-images (201 Kilo-batches_64) Done! Loaded 107 layers from weights-file calculation mAP (mean average precision)... 15156 detections_count = 196580, unique_truth_count = 119254 class_id = 0, name = a, ap = 44.62% (TP = 1910, FP = 233) class_id = 1, name = b, ap = 97.69% (TP = 921, FP = 37) class_id = 2, name = c, ap = 93.08% (TP = 8076, FP = 802) class_id = 3, name = d, ap = 96.74% (TP = 2116, FP = 400) class_id = 4, name = e, ap = 97.25% (TP = 865, FP = 66) class_id = 5, name = f, ap = 94.24% (TP = 6457, FP = 425) class_id = 6, name = g, ap = 99.37% (TP = 6562, FP = 348) class_id = 7, name = h, ap = 95.91% (TP = 1548, FP = 52) class_id = 8, name = i, ap = 85.70% (TP = 962, FP = 41) class_id = 9, name = j, ap = 97.86% (TP = 3598, FP = 381) class_id = 10, name = k, ap = 97.87% (TP = 3341, FP = 234) class_id = 11, name = l, ap = 98.29% (TP = 355, FP = 298) class_id = 12, name = m, ap = 99.34% (TP = 2981, FP = 499) class_id = 13, name = n, ap = 98.59% (TP = 3545, FP = 263) class_id = 14, name = o, ap = 98.36% (TP = 3274, FP = 569) class_id = 15, name = p, ap = 99.53% (TP = 3597, FP = 318) class_id = 16, name = q, ap = 99.48% (TP = 3634, FP = 236) class_id = 17, name = r, ap = 99.45% (TP = 3171, FP = 222) class_id = 18, name = s, ap = 98.36% (TP = 2947, FP = 253) class_id = 19, name = t, ap = 99.05% (TP = 2868, FP = 229) class_id = 20, name = u, ap = 98.78% (TP = 2627, FP = 168) class_id = 21, name = v, ap = 98.75% (TP = 2952, FP = 266) class_id = 22, name = w, ap = 98.17% (TP = 2111, FP = 477) class_id = 23, name = x, ap = 99.26% (TP = 967, FP = 85) class_id = 24, name = y, ap = 98.59% (TP = 2409, FP = 124) class_id = 25, name = z, ap = 99.11% (TP = 2660, FP = 120) class_id = 26, name = aa, ap = 99.17% (TP = 2409, FP = 222) class_id = 27, name = ab, ap = 99.15% (TP = 1877, FP = 167) class_id = 28, name = ac, ap = 99.30% (TP = 3395, FP = 200) class_id = 29, name = ad, ap = 98.45% (TP = 3130, FP = 249) class_id = 30, name = ae, ap = 98.83% (TP = 3206, FP = 220) class_id = 31, name = af, ap = 98.94% (TP = 2907, FP = 141) class_id = 32, name = ag, ap = 98.65% (TP = 3014, FP = 217) class_id = 33, name = ah, ap = 98.10% (TP = 2289, FP = 426) class_id = 34, name = ai, ap = 99.38% (TP = 1205, FP = 114) class_id = 35, name = aj, ap = 98.52% (TP = 2668, FP = 128) class_id = 36, name = ak, ap = 98.69% (TP = 2969, FP = 147) class_id = 37, name = al, ap = 98.63% (TP = 2680, FP = 247) class_id = 38, name = am, ap = 98.83% (TP = 2137, FP = 156) for conf_thresh = 0.25, precision = 0.92, recall = 0.93, F1-score = 0.92 for conf_thresh = 0.25, TP = 110340, FP = 9780, FN = 8914, average IoU = 74.95 % IoU threshold = 50 %, used Area-Under-Curve for each unique Recall mean average precision (mAP@0.50) = 0.966184, or 96.62 % Total Detection Time: 1338 Seconds

Did you try to detect by using -thresh 0.15 at the end of command?

I've changed in "./darknet_video.py" the thresh value to 0.05:

detections = darknet.detect_image(netMain, metaMain, darknet_image, thresh=0.05

It now detects 2 of the missing objects in the frame but still not all i expect to detect (2 detected out of ~22 shown objects with a trained dataset of ~10.000 objects for that class is not so much :( ).

Do you think it make sense when i train only the 6x classes i got problems with, into one, new, single model just to check if the problems still appears then, and to exclude a problem which is possibly caused by the amount of classes and/or different sizes of the annotations for the different trained classes?

Thanks in advance, much appreciated man!

Show obj.data file.
Show images from validation dataset with bad detection results.
If you get good results on Validation/Training dataset, but get bad results on other images, then your training dataset isn't representative, use more trainning images: https://github.com/AlexeyAB/darknet#how-to-improve-object-detection

or each object which you want to detect - there must be at least 1 similar object in the Training dataset with about the same: shape, side of object, relative size, angle of rotation, tilt, illumination. So desirable that your training dataset include images with objects at diffrent: scales, rotations, lightings, from different sides, on different backgrounds - you should preferably have 2000 different images for each class or more, and you should train 2000*classes iterations or more

Thanks @AlexeyAB !

I guess i found my problem with your help :-)

It seems that it is the amount of testing/training images for the classes where i got problems with in my dataset. So what i did was:

In first step I've added more annotations to the problematic classes: (Added ~10.000 annotations from many different images) so that my problematic class 0 got now ~20.000 annotations instead of 10.000k before.

Also i paid attention that i annotate just a handfull of objects for each image (even if i could do more for each image), just to increase the annotations for that class not so hard for each image, but i wanted to increase the amount of different images for that class more to have more different hues/lightnings/backgrounds.....

Second step was to modify my amount for testing/training: If we have a look at ClassID 0, i modified the number of images i use for testing/trainig to 35% Testing, 65% Training. Before it was 20% Testing/80% Training for every Class.

I guess the problem was that i had a high number of annotations, but a small number of different images for that class. (Images = 2660, Annotations = 19.507). The other classes which are working fine before looked fine instead, it seems that there is a context between amount of images~amount of annotations.

When you got a high number of Annotations, but a low number of different images it seems that it leeds to that problem that i had. All other classes where i had a high number of annotations and also a middle/high number of different images didnt had that problem.

So my previous dataset was:

20% (532 Images) for testing
80% (2128 Images) for training which leads to the bad results

Changed that now to:

35% (931 Images) for testing
65% (1729 Images) for training

Here you can see a overview of my full dataset and the testing/training amounts:

dataset

Currently i restarted my training from zero, changed my anchors back to 9 and modified my *.cfg accordingly. It seems good at the moment, my "class 0" is getting detected more and more each iteration it seems :-)

Thanks so far, i keep you updated if i modify anything more. Maybe, it will be good to add the info i provided (if you dont proof me wrong) also to the FAQ section "how to improve object detection"

Cheers, i keep you updated ;-)

Also i paid attention that i annotate just a handfull of objects for each image

https://github.com/AlexeyAB/darknet#how-to-improve-object-detection

check that each object that you want to detect is mandatory labeled in your dataset - no one object in your data set should not be without label. In the most training issues - there are wrong labels in your dataset (got labels by using some conversion script, marked with a third-party tool, ...). Always check your dataset by using: https://github.com/AlexeyAB/Yolo_mark

Only if you are an expert in neural detection networks - recalculate anchors for your dataset for width and height from cfg-file: darknet.exe detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416 then set the same 9 anchors in each of 3 [yolo]-layers in your cfg-file. But you should change indexes of anchors masks= for each [yolo]-layer, so that 1st-[yolo]-layer has anchors larger than 60x60, 2nd larger than 30x30, 3rd remaining. Also you should change the filters=(classes + 5)* before each [yolo]-layer. If many of the calculated anchors do not fit under the appropriate layers - then just try using all the default anchors.

@AlexeyAB Hi，Alexey！ I am training my custom dataset use yolov4, but I got a high loss: I want to get less loss and higher mAP, why the loss is so high than COCO dataset?

Also i paid attention that i annotate just a handfull of objects for each image

https://github.com/AlexeyAB/darknet#how-to-improve-object-detection

check that each object that you want to detect is mandatory labeled in your dataset - no one object in your data set should not be without label. In the most training issues - there are wrong labels in your dataset (got labels by using some conversion script, marked with a third-party tool, ...). Always check your dataset by using: https://github.com/AlexeyAB/Yolo_mark

Only if you are an expert in neural detection networks - recalculate anchors for your dataset for width and height from cfg-file: darknet.exe detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416 then set the same 9 anchors in each of 3 [yolo]-layers in your cfg-file. But you should change indexes of anchors masks= for each [yolo]-layer, so that 1st-[yolo]-layer has anchors larger than 60x60, 2nd larger than 30x30, 3rd remaining. Also you should change the filters=(classes + 5)* before each [yolo]-layer. If many of the calculated anchors do not fit under the appropriate layers - then just try using all the default anchors.

I guess after long long long (have i sayed long already?) time i found my issue...

When i train a new model, with only the single class (0) where i had problems with, it's working brilliant and detects the object perfectly. When using all my 142 classes for training the big model with all classes, the class 0 had the problems that they get lost during training process.

As far as i've found out now, i got 3x classes in my 142x classes that are very very similiar just very small pieces are different and i guess these details for the objects are too small to distinguish to which of the 3x similiar classes my problem class 0 belongs to. Therefore, as it can't decide during the training if it belongs to class 0, 1 or 2 it will ignore it and continue the training on the other classes....

So all in all i guess I've to merge class 0,1,2 to one single class as they are way to identical for yolo training to distinguish between them. I'll re-train with this assumption and come back to you if it finally solved my problem (for the distinguishing from on big class to 3 subclasses afterwards i guess i have to use some other techniques like color detection and seperate the classes with that kind of technique again)

AlexeyAB / darknet

Longer training Iteration doesnt detect previous learned class anymore?! #4832