github-jeff commented 3 years ago

The training is happily chugging along, but I was curious as to about when would I start to see some learning progress. I was looking around for an equation or estimation of the number of iterations to begin to see positive mAP. Is there an equation to estimate the number of iterations needed given a dataset and number of classes? If more values are needed to get an estimate, assume all data to train the model is defined or could be defined easily enough.

In this particular example, I am training on 257 classes each with no fewer than 1000 bounding boxes per class and there are several bounding boxes which exceed 10000 in number. The total number of training images is 32000, usually with many bounding boxes (only 1 class is consistently overlapping). When I was training under similar situation but with 237 classes instead and no overlapping I began to see learning progress around 20K iterations. As of right now, I crossed the 50K iteration mark with no hint at any learning yet. Should I scrap this model and try with something smaller?

Lastly I also did most of the training for the 237 model on an AWS V100 machine. This time I am using smaller NVIDIA M2000 cards at 4gb each (two of them). It is chugging along well enough ~40 seconds per iteration as opposed to the aws machine which was closer to 23 seconds per iteration. The length of time is ok as I'm not really paying for GPU time, but curious if learning will likely never happen or do I just need to wait this out some more.

Regards with thanks, Jeff.

stephanecharette commented 3 years ago

I cannot answer your questions as I really don't know. But since you are training with a large number of images and many iterations, I thought it might make sense to point this out if you want to shave some time off the training: https://www.ccoderun.ca/programming/darknet_faq/#time_to_train

github-jeff commented 3 years ago

I am currently training on a default-ish form of the yolov4 cfg. Below are the salient points. The subdivisions are higher than with the V100 AWS machine given the lower amount of GPU ram (4GB) available on my local machine. The training has been running for about 10 days with the full training slated to finish in about 2 months unless there is no further progress and I will manually end it earlier.

As I mentioned above the training is approximately 10K iterations every two days or ~40s per iteration using two mid-level GPU cards. Is that about right for an 985x720 image in a 64/40 (batch/subdivision) or is that alarmingly slow. We are coming up on 60K at this point and mAP is still flat zero with loss oscillation between 118% and 144%.

One of the key points in your link, thank you btw, is a reference to set all the dataset images to the correct width x height prior to training. In this case, 416x416. This was a question I had during the research phase. If I down sample to 416x416 do I loose pixel resolution data which could effect image recognition post training, or does that not really matter? During training it appears to expand up to 608 to and scale back to 384. I think this is in the cfg file somewhere but not sure where. Do I set all dataset images to to 608 or 384? Also the original images are not square, they are rectangular. Specifically, 985x720. Does the not square thing matter? If I make them square does that present a problem?

Lastly, if I do resize the dataset can I do this mid training or would that adversely effect results. If mid-training is of no real concern, is it better to resize and force a rectangular image square, set the long edge to 416 making the image 416x304 or break the original image up into 416x416 square sections keeping the higher resolution just feeding yolov4 in smaller pieces.

Training

batch=64 subdivisions=40 width=416 height=416 channels=3 momentum=0.949 decay=0.0005 angle=0 saturation = 1.5 exposure = 1.5 hue=.1 learning_rate=0.001 burn_in=1000 max_batches = 514000 policy=steps steps=411200,462600 scales=.1,.1

github-jeff commented 3 years ago

Update-1: I dug into the details on the reduced resolution and it might have been the problem. The rectangular images 985x720 were just too pixelated as 416x416 to be of any use detection-wise. That might have been why the AWS machine was able to do the first model in 608x608 (probably in a better resolution if I optimized). My local machine was only running 416x416.

When I slowly brought the resolution down in the cfg file from the original resolution keeping as close to the aspect ratio as possible in increments of 32. I was able to run (width x height) 544x416 with a 64/34 batch/subdivision ratio. The rewritten_bbox moved up to 3.6% from ~0.30% right away and the iteration time decreased to 25 seconds down from ~40 seconds per iteration. If I recall AWS was around 1.3% at ~23 seconds.

Note: I also adjusted the anchors accordingly.

So I have four questions:

1) Why can't I train on full resolution images? There does not seem to be any combination of batch/subdivision to allow for full resolution. For example, if I can run 544 x 416 @ 64/32 then why can't I run 992 x 736 @ 32/16 or even 32/32. What am I missing here? This should be straightforward resource allocation but the math (or my cryptic understanding of it) is not working out. Yolov4 also does not spit out the memory requirement before the out of memory error. This makes it hard to guess at the next cfg setting. Hopefully there is a better way.

2) Is a higher rewritten_bbox % a good thing? The model is oscillating between 3.7 and 3.9%

3) The learning rate in my cfg is 0.001 but it appears to be .002 in the output. Does this change automatically? Is a higher learning rate better?

4) Should I start over now that I have changed training image resolution?

github-jeff commented 3 years ago

Update-2: I resumed with the new resolution and updated anchors at the previous best weight ~50K iterations. It ran for another 6K iterations or so but I received a very strange mAP result. @stephanecharette Should I start over? See below:

(next mAP calculation at 57783 iterations) 
 Last accuracy mAP@0.5 = 897215575757946880.00 %, best = 897215575757946880.00 % 
 56044: 194.560196, 186.603851 avg loss, 0.002000 rate, 28.873265 seconds, 3810992 images, 1877.503294 hours left

previous mAP for this model was a flat 0 after nearly 65K iterations before I started to try something else. rewritten_bbox is holding in the 3.7 to 3.9% range after 6K iterations.

AlexeyAB commented 3 years ago

There was a bug. I resolved it.

github-jeff commented 3 years ago

Great. TY. Is this something that I can pause training, then patch, or should I recompile and restart the training? OS is ubuntu.

Also what was the bug?

AlexeyAB commented 3 years ago

scale_xy was applied to objectness instead of x,y: https://github.com/AlexeyAB/darknet/commit/b25c2c6cbdef3a849fd1f17eddfb5aa1387d868d#diff-a191a7d286ab1bacf527ae4b5edfbad6951b06a4d80685393577af64eb8e8a8fR1192

Its better to start training from the begining.

github-jeff commented 3 years ago

Update-3: There was a handy reference chart for memory usage here. Within that reference, one can gleam a few key points which are detailed out in several papers but do not necessarily translate to new modelers.

In my overly simplistic summation, it basically comes down to more layers = more gpu memory demand. See layer summation chart here

As such, it follows that 64/64 (batch/subdivision) is the lowest memory allocation possible given the number of layers in the cfg file and resolution (width x height).

Also it does not matter how many video cards you have as gpu memory is not a shared pool across cards. More gpu cards increases the number of concurrent iterations. A single iteration must still fit on one video card.

I was attempting to train on yolov4-custom.cfg. Which is 162 layers. In the minimum configuration (64/64) in a 416x416 resolution you need 4.2gb of gpu memory. This scenario works for my hardware but when I increased to 608x608 I needed 6.9gb. I do not have that, so the training, as expected, errors with an out of memory condition.

Heading over to the handy layer column in the second link I simply went down the list in an ever reducing layer count until I got to yolo-tiny. This is 38 layers instead of 162. By stepping down the configuration from 64/64, I was able to train with full image resolution 992x736 in a 64/16 configuration. Since I have two gpu's my iterations expectantly count by 2 and each cycle takes about 12 seconds.

Once I found out the optimized memory conditions I went back and adjusted the anchors using the following command.

./darknet detector calc_anchors data/obj.data -num_of_clusters 9 -width 992 -height 736

The updated anchors reduced the rewritten_bbox from 3.9% to 2.7%.

The goal in optimization appears to determine the fewest layers necessary to get the highest confidence model trained to achieve the specific goal. I know this sounds obvious now but it was not very clear before.

What is still fuzzy to me is:

How does image resolution vs layers play into a trained model. For example is higher image resolution better, or is more layers better. This answer is of course both (possibly depends on the goal/dataset) but what if you cannot have both and even if you could have it, do you really need it? I have not as of yet figured out how to specify the layers in the cfg file. If someone has a good tutorial please share it.
Should rewritten_bbox % be as low as possible or ideally zero? Is there a threshold where any further reductions do not really effect the results. i.e. anything less than 10% is great, less than 5%, is whip cream, less than 1% is the cherry but all you really care about is the ice cream so anything less than 10% will be just fine.

AlexeyAB commented 3 years ago

optimal numbers of layers and resolution depend on dataset. The smaller objects - the higher resolution is required. The large objects - the more layers are required. There is an article on choosing the optimal number of layers, filters and resolution for MS COCO dataset: https://arxiv.org/pdf/1911.09070.pdf
It depends on what accuracy and speed do you want. To reduce rewritten_bbox % just increase resolution and/or move some masks from [yolo] layers with low resolution, the [yolo] layers with higher resolution, and train. Also iou_thresh=1 may reduce rewritten_bbox %

github-jeff commented 3 years ago

Thank you and makes perfect sense. Anybody following this case should read that article.

From a practical point of view it's unclear what constitutes a large or small object given an input image. Is small anything that occupies less than say 5% of an image by area? What would the minimum resolution need to be for an object occupying X% of a given input image to be trained with mAP north of 85% based upon pixel area. I see that at some point the importance of resolution falls away in favor of additional layers but how many layers given a object size and resolution remains illusive. I'm sure I could write a little something to crawl through a labeled dataset and figure out an optimal recommendation for boxes per class by size etc. based upon available system resources.

Maybe a hardware led method of going about object training is more in research than application.

github-jeff commented 3 years ago

Update-4: nan question

(next mAP calculation at 12452 iterations) 11524: 47.876884, 48.406509 avg loss, 0.005220 rate, 21.247497 seconds, 1475072 images, 978.731018 hours left Loaded: 0.000052 seconds v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 157, class_loss = 77.000000, iou_loss = nan, total_loss = nan v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 164, class_loss = 79.499992, iou_loss = nan, total_loss = nan v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 49, class_loss = 24.500000, iou_loss = 0.000015, total_loss = 24.500015 total_bbox = 18482997, rewritten_bbox = 2.318639 %

is

iou_loss = nan, total_loss = nan

normal after 12k iterations or do I have a problem that needs addressing?

AlexeyAB commented 3 years ago

Do you get Nan for each iteration or only for some of these?

github-jeff commented 3 years ago

I would say all start of iteration headers, and most sub-steps have something with nan in it.

`(next mAP calculation at 12452 iterations) 
 11650: 48.986256, 48.644478 avg loss, 0.005220 rate, 11.557267 seconds, 1491200 images, 978.636895 hours left
Loaded: 0.000059 seconds
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 119, class_loss = 56.999996, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 148, class_loss = 71.999992, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 40, class_loss = 20.000000, iou_loss = 0.000013, total_loss = 20.000013 
 total_bbox = 18685040, rewritten_bbox = 2.318604 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 51, class_loss = 25.000000, iou_loss = 0.000015, total_loss = 25.000015 
 total_bbox = 18696438, rewritten_bbox = 2.322539 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 164, class_loss = 80.500000, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 74, class_loss = 37.000004, iou_loss = 0.000023, total_loss = 37.000027 
 total_bbox = 18685278, rewritten_bbox = 2.318579 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 153, class_loss = 72.500000, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 65, class_loss = 31.999998, iou_loss = 0.000021, total_loss = 32.000019 
 total_bbox = 18696656, rewritten_bbox = 2.322560 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 147, class_loss = 70.500008, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 88, class_loss = 44.000004, iou_loss = 0.000019, total_loss = 44.000023 
 total_bbox = 18685513, rewritten_bbox = 2.318583 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 129, class_loss = 61.500000, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 65, class_loss = 32.500000, iou_loss = 0.000015, total_loss = 32.500015 
 total_bbox = 18696850, rewritten_bbox = 2.322562 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 148, class_loss = 71.500008, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 58, class_loss = 28.500000, iou_loss = 0.000021, total_loss = 28.500021 
 total_bbox = 18685719, rewritten_bbox = 2.318583 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 136, class_loss = 65.000000, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 47, class_loss = 23.500002, iou_loss = 0.000013, total_loss = 23.500015 
 total_bbox = 18697033, rewritten_bbox = 2.322572 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 146, class_loss = 69.999992, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 66, class_loss = 33.000000, iou_loss = 0.000023, total_loss = 33.000023 
 total_bbox = 18685931, rewritten_bbox = 2.318584 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 129, class_loss = 63.499996, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 39, class_loss = 19.000002, iou_loss = 0.000008, total_loss = 19.000010 
 total_bbox = 18697201, rewritten_bbox = 2.322567 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 143, class_loss = 68.500008, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 54, class_loss = 26.999998, iou_loss = 0.000019, total_loss = 27.000017 
 total_bbox = 18686128, rewritten_bbox = 2.318576 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 124, class_loss = 59.500004, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 43, class_loss = 21.500000, iou_loss = 0.000010, total_loss = 21.500010 
 total_bbox = 18697368, rewritten_bbox = 2.322573 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 145, class_loss = 71.999992, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 41, class_loss = 20.500002, iou_loss = 0.000013, total_loss = 20.500015 
 total_bbox = 18686314, rewritten_bbox = 2.318552 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 142, class_loss = 66.000008, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 62, class_loss = 30.499998, iou_loss = 0.000021, total_loss = 30.500019 
 total_bbox = 18697572, rewritten_bbox = 2.322606 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 154, class_loss = 70.999992, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 103, class_loss = 50.500000, iou_loss = 0.000034, total_loss = 50.500034 
 total_bbox = 18686571, rewritten_bbox = 2.318590 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 141, class_loss = 69.500000, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 60, class_loss = 29.500000, iou_loss = 0.000021, total_loss = 29.500021 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 136, class_loss = 63.499996, iou_loss = nan, total_loss = nan 
 total_bbox = 18697773, rewritten_bbox = 2.322598 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 47, class_loss = 23.500002, iou_loss = 0.000010, total_loss = 23.500011 
 total_bbox = 18686754, rewritten_bbox = 2.318589 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 126, class_loss = 56.000004, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 161, class_loss = 77.500000, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 73, class_loss = 36.500000, iou_loss = 0.000023, total_loss = 36.500023 
 total_bbox = 18697972, rewritten_bbox = 2.322637 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 64, class_loss = 31.999998, iou_loss = 0.000021, total_loss = 32.000019 
 total_bbox = 18686979, rewritten_bbox = 2.318582 % 
`

github-jeff commented 3 years ago

Update-5: Still chugging along but without much change. Still getting nan's, and there has not been much progress on mAP or on loss, in general. It's likely still early as it's only 25K, but starting to get concerned. May need to optimize layers in a truly custom cfg.

 (next mAP calculation at 24904 iterations) 
 24444: 48.392509, 48.727810 avg loss, 0.005220 rate, 12.469223 seconds, 3128832 images, 786.661219 hours left
Loaded: 0.000039 seconds
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 162, class_loss = 76.000008, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 123, class_loss = 60.499996, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 83, class_loss = 41.500000, iou_loss = 0.000019, total_loss = 41.500019 
 total_bbox = 39138945, rewritten_bbox = 2.316580 %

Also getting quite a few data label bad.list items

data/labels/32_7.png
data/labels/33_7.png
data/labels/34_7.png
...
data/labels/124_7.png
data/labels/125_7.png
data/labels/126_7.png

They do not exist in the obj.names or on any of the *.txt files (train, valid, or test). However, they do exist in ../darknet/build/darknet/x64/data/labels but that directory is not referenced in the cfg, command, or any user editable configuration file that I know about.

Digging deeper the reference appears to come from here: ../src/image.c @ 267

image **load_alphabet()
{
    int i, j;
    const int nsize = 8;
    image** alphabets = (image**)xcalloc(nsize, sizeof(image*));
    for(j = 0; j < nsize; ++j){
        alphabets[j] = (image*)xcalloc(128, sizeof(image));
        for(i = 32; i < 127; ++i){
            char buff[256];
            sprintf(buff, "data/labels/%d_%d.png", i, j);
            alphabets[j][i] = load_image_color(buff, 0, 0);
        }
    }
    return alphabets;
}

github-jeff commented 3 years ago

Update-6: I think I am going to stop this model. It failed to progress after 45K iterations, and there is still no improvement on nan. Will likely need more layers. Any guidance on the cfg would be quite helpful. Also if there is a bug fix on this coming please let me know so I can wait before trying the next cfg version. In the meantime I'll close the ticket.

(next mAP calculation at 46412 iterations) 
 Last accuracy mAP@0.5 = 0.00 %, best = 0.00 % 
 45288: 49.212818, 48.699146 avg loss, 0.005220 rate, 11.795647 seconds, 5796864 images, 895.675891 hours left
Loaded: 0.000066 seconds
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 165, class_loss = 79.499992, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 144, class_loss = 70.500008, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 63, class_loss = 31.000000, iou_loss = 0.000021, total_loss = 31.000021 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.000000), count: 90, class_loss = 44.500000, iou_loss = 0.000019, total_loss = 44.500019 
 total_bbox = 72515647, rewritten_bbox = 2.318375 % 
 total_bbox = 72483836, rewritten_bbox = 2.314346 %

github-jeff commented 3 years ago

Update-7: I was able to run csresnext50-panet-spp-original-optimal.cfg in resolution 608 x 448 in a 64/40 batch/subdivision with 257 classes on a 4gb video card (2x). The iterations are slower ~57 seconds run in pairs, rewritten_bbox is quite low at ~0.2%, and I am not getting any nan values even at the start.

(next mAP calculation at 1811 iterations) 
 4: 9925.780273, 9964.247070 avg loss, 0.000000 rate, 58.372452 seconds, 320 images, 4047.617760 hours left
Loaded: 0.000057 seconds
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 115 Avg (IOU: 0.297346), count: 193, class_loss = 22210.791016, iou_loss = 975.906250, total_loss = 23186.697266 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 115 Avg (IOU: 0.294278), count: 192, class_loss = 22418.203125, iou_loss = 818.537109, total_loss = 23236.740234 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 126 Avg (IOU: 0.317007), count: 109, class_loss = 9270.649414, iou_loss = 96.716797, total_loss = 9367.366211 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 126 Avg (IOU: 0.304318), count: 154, class_loss = 12313.989258, iou_loss = 135.184570, total_loss = 12449.173828 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 137 Avg (IOU: 0.314579), count: 20, class_loss = 1906.412109, iou_loss = 1.984253, total_loss = 1908.396362 
 total_bbox = 20858, rewritten_bbox = 0.287659 %

With many more layers: (138) I am hopeful this configuration will learn. I will update, when there is progress.

github-jeff commented 3 years ago

Update-8: Started getting nan on total loss again, and learning seems to have paused at 0.005220. Which was the same value as tiny in the previous attempt. Odd coincidence, or is that a typical step in the process? Otherwise average loss has been steadily decreasing.

(next mAP calculation at 1811 iterations) 
 1808: 144.769989, 155.402222 avg loss, 0.005220 rate, 33.628986 seconds, 144640 images, 2991.589904 hours left
Loaded: 0.000071 seconds
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 115 Avg (IOU: 0.000000), count: 219, class_loss = 438.000000, iou_loss = 0.000092, total_loss = 438.000092 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 126 Avg (IOU: 0.000000), count: 76, class_loss = 152.000000, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 115 Avg (IOU: 0.000000), count: 182, class_loss = 364.000031, iou_loss = 0.000092, total_loss = 364.000122 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 137 Avg (IOU: 0.000000), count: 7, class_loss = 14.000001, iou_loss = 0.000088, total_loss = 14.000089 
 total_bbox = 7934224, rewritten_bbox = 0.188109 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 126 Avg (IOU: 0.000000), count: 93, class_loss = 184.000000, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 137 Avg (IOU: 0.000000), count: 11, class_loss = 22.000000, iou_loss = 0.000212, total_loss = 22.000212 
 total_bbox = 7905297, rewritten_bbox = 0.184307 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 115 Avg (IOU: 0.000000), count: 148, class_loss = 296.000000, iou_loss = 0.000153, total_loss = 296.000153 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 126 Avg (IOU: 0.000000), count: 74, class_loss = 148.000015, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 137 Avg (IOU: 0.000000), count: 7, class_loss = 14.000001, iou_loss = 0.000107, total_loss = 14.000108 
 total_bbox = 7934453, rewritten_bbox = 0.188104 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 115 Avg (IOU: 0.000000), count: 177, class_loss = 354.000000, iou_loss = 0.000153, total_loss = 354.000153 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 126 Avg (IOU: 0.000000), count: 72, class_loss = 144.000000, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 137 Avg (IOU: 0.000000), count: 7, class_loss = 14.000001, iou_loss = 0.000019, total_loss = 14.000020 
 total_bbox = 7905553, rewritten_bbox = 0.184301 %

I think its a problem with the second yolo. See class number 20. It is reporting loss in the third position but not the second position. Most classes are not reporting loss in the second position.

v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 115 Avg (IOU: 0.000000), count: 130, class_loss = 260.000000, iou_loss = 0.000061, total_loss = 260.000061 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 126 Avg (IOU: 0.000000), count: 103, class_loss = 206.000000, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 137 Avg (IOU: 0.000000), count: 20, class_loss = 40.000000, iou_loss = 0.000153, total_loss = 40.000153

vs

v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 115 Avg (IOU: 0.000000), count: 112, class_loss = 224.000015, iou_loss = 0.000076, total_loss = 224.000092 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 126 Avg (IOU: 0.000000), count: 20, class_loss = 40.000000, iou_loss = nan, total_loss = nan 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 137 Avg (IOU: 0.000000), count: 3, class_loss = 6.000000, iou_loss = 0.000091, total_loss = 6.000092

github-jeff commented 3 years ago

Update-9: It's still early only 3.3K iterations but the learning rate has stopped progressing at 0.005220. Which I find curious given the tiny.cfg model stopped learning progress at exactly the same rate. See above in Update-6.

(next mAP calculation at 3623 iterations) 
 Last accuracy mAP@0.5 = 0.00 %, best = 0.00 % 
 3308: 177.661652, 156.024826 avg loss, 0.005220 rate, 74.013698 seconds, 264640 images, 3388.211679 hours left
Loaded: 0.000068 seconds

AlexeyAB commented 3 years ago

Do you use the latest version of Darknet?

github-jeff commented 3 years ago

updated via git clone on 12/12. Is that the latest? Also learning rate not progressing beyond 0.005220.

AlexeyAB commented 3 years ago

There was bug fix at Dec 12, 2020. I don't know do you use it or don't.

Can you show your chart.png files?

Do you train with flag -map ?

What command do you use for training?

(next mAP calculation at 24904 iterations) 24444: 48.392509, 48.727810 avg loss,

What mAP do you get? Can it detect anything?

64/40 batch/subdivision

batch/subdivisions - should be integer value

I was attempting to train on yolov4-custom.cfg. Which is 162 layers. In the minimum configuration (64/64) in a 416x416 resolution you need 4.2gb of gpu memory. This scenario works for my hardware but when I increased to 608x608 I needed 6.9gb. I do not have that, so the training, as expected, errors with an out of memory condition.

Replace random= to resize=1.5 for each of 3 [yolo] layers in cfg-file.

I would say all start of iteration headers, and most sub-steps have something with nan in it.

It seems there is no problem: https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

Note: If during training you see nan values for avg (loss) field - then training goes wrong, but if nan is in some other lines - then training goes well.

github-jeff commented 3 years ago

chart_csresnext50-panet-spp-obj-v1

I did not keep the other charts.

Ubuntu 20.04 ./darknet detector train data/obj.data cfg/csresnext50-panet-spp-obj-v1.cfg -dont_show -map -gpus 0,1

There is only one random in the cfg files random=1 @ 1036

When I trained at 608x608 on yolov4-custom.cfg is worked fine up to about 75% mAP. Started to converge around 20K cycles if memory served. In practice I was getting north of 90% confidence pretty much all the time. This was with just over 200 classes. I added more, but I am trying do this locally instead of paying 4 billion per minute at AWS.

AlexeyAB commented 3 years ago

Try to download the latest darknet version and recompile. It seems you use old version.

Then show chart.png after about 20 000 iterations.

github-jeff commented 3 years ago

Ok, I'll kill this and give it another go. Your fork or pjreddie? I have been using your fork up to this point.

AlexeyAB commented 3 years ago

My fork.

Also what cfg-file do yo use currently?

Try to use this:

And what training command do you use?

github-jeff commented 3 years ago

Wiped out the old files and started from scratch via git clone of your fork. I'll give csp a try. The training command will be as follows. I also tried tiny and tiny_3L (3L errored out oddly) I also trained with yolov4-custom on AWS which worked fine but its too gpu intensive for me to run locally in a high enough resolution for training.

./darknet detector train data/obj.data cfg/yolov4-csp.cfg -dont_show -map -gpus 0,1

AlexeyAB commented 3 years ago

Set burn_in=2000 learning_rate=0.002 since you train on 2 GPUs.

Use this pre-trained file https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v4_pre/yolov4-csp.conv.142

And this training command: ./darknet detector train data/obj.data cfg/yolov4-csp.cfg yolov4-csp.conv.142 -dont_show -map -gpus 0,1

github-jeff commented 3 years ago

Ok on the settings. For the pre-trained file the detection objects I am attempting to train are vastly different than those in the training file. I think I read somewhere that its best to start from scratch if there is unlikely to be any matches. Was that an incorrect assumption?

For subdivisions you mentioned an integer. The memory usage tables I have seen have subdivisions all divisible by 8. Granted none have trials run at subdivisions=40, the next step down is always 32. But at 32 I run out of memory. So I tried 40 and that was ok. Is there no difference between 40 and 64?

I am also making these changes:

#batch=64
#subdivisions=8
#width=512
#height=512
batch=64
subdivisions=40
width=576
height=416

#learning_rate=0.001
#burn_in=1000
learning_rate=0.002
burn_in=2000

in three places (3X)

#filters=255
filters=786
activation=logistic

[yolo]
mask = 3,4,5
anchors = 12, 16, 19, 36, 40, 28, 36, 75, 76, 55, 72, 146, 142, 110, 192, 243, 459, 401
#classes=80
classes=257

#random=1
random=1.5

It's running.

stephanecharette commented 3 years ago

I also tried tiny and tiny_3L (3L errored out oddly) I also trained with yolov4-custom on AWS which worked fine but its too gpu intensive for me to run locally in a high enough resolution for training.

If yolov4-custom was too much for you, then yolov4-csp wont work either. The next step down from yolov4 is yolov4-tiny-3l. Then the next lower step down beyond that is yolov4-tiny.

See this part of the table from https://www.ccoderun.ca/darkmark/Configuration.html#ConfigNew:

col1=name
col2=number of lines
col3=number of layers
col4=number of yolo layers

[...] the detection objects I am attempting to train are vastly different than those in the training file. I think I read somewhere that its best to start from scratch if there is unlikely to be any matches.

I hope what you mean is that the objects you want to both train and detect are vastly different from the 80-class MSCOCO in the pre-trained weights. In which case, you are correct that it servers little purpose to use pre-trained weights. You are better off starting from scratch and ignoring the first couple thousand iterations.

AlexeyAB commented 3 years ago

I think I read somewhere that its best to start from scratch if there is unlikely to be any matches. Was that an incorrect assumption?

This is a very vague concept of similar or not similar. Therefore, I would recommend using pre-trained weights almost always.

github-jeff commented 3 years ago

csp is currently running, see above for the cfg modifications. Honestly, I think when I was trying to get yolov4-custom to work locally I was mucking around with values I did not have any understanding about. My concern was that I could get yolov4-custom to run but only at 416x416. I did not try rectangular and it defn did not learn. In any case, csp seems more intense, I'm running it at higher than 416x416 so maybe?

At what point do I need to update anchors? I changed resolution in the cfg. Is that when I should also redo the anchors?

For images I am attempting to run an object detector for surface defects of cosmetic products. Scratches, scuffs, etc. as they come down the manufacturing line. Would pre-trained help with that task? It seems like coco was for objects like people, dog, bike, etc.

stephanecharette commented 3 years ago

For images I am attempting to run an object detector for surface defects of cosmetic products. Scratches, scuffs, etc. as they come down the manufacturing line. Would pre-trained help with that task? It seems like coco was for objects like people, dog, bike, etc.

It sounds like you do the same kind of work as I do, Jeff. I also deal with things like conveyor belts and objects moving past a camera where I need to detect various things like missing parts, extra parts, wrong rotation, breaks, cracks, painting defects, etc. I've built dozens of neural networks this way, all with yolov3-tiny, yolov4-tiny, and yolov4-tiny-3l, and in each case, I've never once had to use the pre-trained weights.

github-jeff commented 3 years ago

Certainly sounds like it. Also hoping to train a robot to find clasps/buttons for case cycle testing. I certianly jumped into the deep end on this and just sorting out the right methodology. Thank you for helping me out. I decided to skip pre-trained weights for this go around. So far so good, but I clearly will need to bump up gpu HW quite significantly. Please PM me if there is interest in participating in paid projects. Right now we are just trying all this out unofficially but it's come up a few times now.

On Thu, Dec 17, 2020, 9:18 PM Stéphane Charette notifications@github.com wrote:

For images I am attempting to run an object detector for surface defects of cosmetic products. Scratches, scuffs, etc. as they come down the manufacturing line. Would pre-trained help with that task? It seems like coco was for objects like people, dog, bike, etc.

It sounds like you do the same kind of work as I do, Jeff. I also deal with things like conveyor belts and objects moving past a camera where I need to detect various things like missing parts, extra parts, wrong rotation, breaks, cracks, painting defects, etc. I've built dozens of neural networks this way, all with yolov3-tiny, yolov4-tiny, and yolov4-tiny-3l, and in each case, I've never once had to use the pre-trained weights.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/AlexeyAB/darknet/issues/7090#issuecomment-747824109, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE5LSJEQR3JFCUEI4JL2EO3SVK3YRANCNFSM4URCJ45Q .

stephanecharette commented 3 years ago

Please PM me if there is interest in participating in paid projects.

Your profile has no contact information on it! See https://www.ccoderun.ca/ml/ for my work. Also see the Darknet/YOLO FAQ that I maintain: https://www.ccoderun.ca/programming/darknet_faq/

github-jeff commented 3 years ago

Thanks! I will reach out after the holidays as everything is pretty much shutting down now.

For this project, I hit a snag last night. I had to increase subdivisions to 64 as the first mAP calculation at 2K cycles put me over budget on resources and it killed the training. The increased subdivisions from 40 to 64 increased cycle time from ~50 seconds to ~90 seconds. However, as you can see below going from 40 to 64 dropped sources by ~1Gb. After the next mAP I will try 48 and 56 to see if I can shave off a few seconds per iteration. Also no nan yet.

stephanecharette commented 3 years ago

Doesn't subdivisions have to evenly divide the batch size? So if your batch size is 64, then the valid numbers that may be used for subdivisions would be 1, 2, 4, 8, 16, 32, and 64? You mention using values like 48 or 56, but...? I've never fully understood this. @AlexeyAB can you comment on what kind of values can be used for subdivisions?

After the next mAP I will try 48 and 56 to see if I can shave off a few seconds per iteration.

If you are interested in reducing the time required, make sure you read this: https://www.ccoderun.ca/programming/darknet_faq/#time_to_train

AlexeyAB commented 3 years ago

Just try to set batch=64 subdivisions=40 and look at the actual values: So batch=64 subdivisions=40 is the same as batch=40 subdivisions=40, that's why training is faster

github-jeff commented 3 years ago

From the wiki, we learn the following. The question is two-fold. What is an anchor, and when do anchors need to be updated in the cfg file? For this model, the resolution was changed to rectangular from the originally specified CFG file. My hunch is anchors should be updated accordingly but it is a little bit opaque what they do.

mask = 3,4,5 - indexes of anchors which are used in this [yolo]-layer

anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326 - initial sizes if bounded_boxes that will be adjusted

num=9 - total number of anchors

github-jeff commented 3 years ago

It appears during an mAP calculation the resources are actually less. On my equipment ~500Mb less per gpu card. Perhaps there is an initial start overage which I did not capture the first time around which killed the training. In any case this is happily moving along and we will just see what happens. So far no nan.

(next mAP calculation at 3451 iterations) 
 2649: 3244.609375, 3629.767578 avg loss, 0.004000 rate, 93.576549 seconds, 339072 images, 4160.888061 hours left
Loaded: 0.000055 seconds
v3 (iou loss, Normalizer: (iou: 0.05, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.132707), count: 176, class_loss = 11399.327148, iou_loss = -8518.097656, total_loss = 2881.229248 
v3 (iou loss, Normalizer: (iou: 0.05, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.121238), count: 202, class_loss = 13232.225586, iou_loss = -9893.580078, total_loss = 3338.645020 
v3 (iou loss, Normalizer: (iou: 0.05, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.102635), count: 121, class_loss = 42.777855, iou_loss = 2.791077, total_loss = 45.568932 
v3 (iou loss, Normalizer: (iou: 0.05, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.052645), count: 28, class_loss = 2.884825, iou_loss = 4.333503, total_loss = 7.218328 
 total_bbox = 11289773, rewritten_bbox = 0.271130 %

github-jeff commented 3 years ago

Is there an out-file option? Something simple like report average loss to a text file. It would be useful in the early stages to see if there any progress that occurs outside of the graph.

Also very strange and quite large mAP. I am also not entirely sure I have seen a negative iou_loss before. In broad strokes, there is no nan yet and total_loss has been steadily falling so I'm going to let it go to see how this develops but if you have any insight as to what is happening here it would be very much appreciated.

(next mAP calculation at 5715 iterations) 
 Last accuracy mAP@0.5 = 4100135990291454421362255331328.00 %, best = 4100135990291454421362255331328.00 % 
 4979: 3442.454102, 3624.685547 avg loss, 0.004000 rate, 33.254574 seconds, 637312 images, 4058.260618 hours left
Loaded: 0.000074 seconds
v3 (iou loss, Normalizer: (iou: 0.05, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.087474), count: 126, class_loss = 8132.995117, iou_loss = -6096.312500, total_loss = 2036.682861 
v3 (iou loss, Normalizer: (iou: 0.05, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.034224), count: 38, class_loss = 14.877279, iou_loss = 0.060805, total_loss = 14.938085 
v3 (iou loss, Normalizer: (iou: 0.05, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.314343), count: 3, class_loss = 0.104673, iou_loss = 0.199073, total_loss = 0.303746 
 total_bbox = 29096912, rewritten_bbox = 0.266918 %

github-jeff commented 3 years ago

Learning seems to have halted at 0.004 so I took the opportunity to update the anchors to see if that helps things.


./darknet detector calc_anchors data/obj.data -num_of_clusters 9 -width 576 -height 416
 CUDA-version: 10010 (11010), cuDNN: 7.6.5, GPU count: 2  
 OpenCV version: 4.5.0

 num_of_clusters = 9, width = 576, height = 416 
 read labels from 18114 images 
 loaded      image: 18114    box: 1104954
 all loaded. 

 calculating k-means++ ...

 iterations = 4

counters_per_class = 1920, 989, 931, 1825, 908, 917, 1797, 858, 939, 1813, 879, 934, 1830, 938, 892, 1858, 913, 945, 1934, 989, 945, 1833, 948, 885, 1832, 936, 896, 1925, 977, 948, 1835, 908, 927, 1823, 902, 921, 1988, 1032, 956, 1887, 924, 963, 1771, 845, 926, 1905, 999, 906, 1850, 906, 944, 1854, 948, 906, 1848, 915, 933, 1953, 1030, 923, 1872, 956, 916, 1770, 887, 883, 1873, 932, 941, 1959, 981, 978, 1883, 957, 926, 1941, 985, 956, 1847, 950, 897, 1822, 904, 918, 1846, 904, 942, 1874, 905, 969, 1849, 893, 956, 1809, 901, 908, 1795, 886, 909, 1893, 949, 944, 1776, 896, 880, 1774, 890, 884, 1901, 947, 954, 1885, 931, 954, 1890, 979, 911, 1853, 956, 897, 1812, 933, 879, 1873, 973, 900, 1829, 912, 917, 1858, 924, 934, 1752, 846, 906, 1800, 876, 924, 1882, 945, 937, 1846, 912, 934, 1914, 1025, 889, 1854, 912, 942, 1824, 876, 948, 1844, 886, 958, 1835, 889, 946, 1812, 855, 957, 1862, 921, 941, 1968, 992, 976, 1873, 955, 918, 1957, 1048, 909, 1818, 881, 937, 1846, 938, 908, 1855, 894, 961, 1916, 990, 926, 1874, 933, 941, 1852, 927, 925, 1870, 910, 960, 1801, 911, 890, 1806, 881, 925, 1916, 987, 929, 1781, 880, 901, 1810, 865, 945, 1874, 889, 985, 1872, 934, 938, 1878, 962, 916, 1853, 903, 950, 1876, 937, 939, 1831, 941, 890, 1913, 990, 923, 1882, 890, 992, 12076, 6038, 144912, 18114, 18114, 0, 108684, 18114, 36228, 18114, 97908, 107492, 109931, 1158, 1260, 1215, 1266, 1105, 1150, 1194, 1181, 1192, 108684


 avg IoU = 91.13 % 

Saving anchors to the file: anchors.txt 
anchors =  15, 14,  11, 25,  11, 26,  70,  9,  35, 18,  53, 23,  82, 18,  35, 49, 409,180

Resources spiked back up to north of 4gb so I am a little concerned this will error out on mAP again. We will see soon enough.

Other notable changes are rewritten_bbox jumped from ~0.25% to ~2.9% and average loss dropped by about 1000. So that is something. Still trying to get a feel for this and clearly not there yet.

Also resuming from _last.weights reset the iterations back to about 1100. So I just restarted from scratch. Let's see how this adjustment goes.

github-jeff commented 3 years ago

Still chugging along. It did not crash after 2K mAP. Other updates include rewritten_bbox increased to ~5.25%, but average loss is about 1000 less than the previous model without updated anchors ~2700. There was also a detected image during mAP calculation at 0.23%. Yes, that is quite small but it is also something that has not happened yet during local processing. So I am continuing as all indications other than rewritten_bbox appear to be trending in a positive direction.

github-jeff commented 3 years ago

Is avg loss the average of each v3 output i.e. "total_loss" or is it the average of "class_loss". If it's "total_loss" then something is up as it should be less than 2000 but it is reporting closer to 3000. If it's class_loss then it is likely too low even at 3000. Perhaps some sort of weighted average? How is this calculated?

 (next mAP calculation at 3132 iterations) 
 Last accuracy mAP@0.5 = 0.00 %, best = 0.00 % 
 2622: 1579.310303, 2917.290771 avg loss, 0.004000 rate, 127.155775 seconds, 335616 images, 4389.243433 hours left
Loaded: 0.000071 seconds
v3 (iou loss, Normalizer: (iou: 0.05, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.136669), count: 65, class_loss = 3229.023438, iou_loss = -2417.570557, total_loss = 811.452942 
v3 (iou loss, Normalizer: (iou: 0.05, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.106887), count: 157, class_loss = 12.806120, iou_loss = 4.809479, total_loss = 17.615599 
v3 (iou loss, Normalizer: (iou: 0.05, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.064179), count: 99, class_loss = 4.600789, iou_loss = 7.256286, total_loss = 11.857075 
 total_bbox = 36922453, rewritten_bbox = 5.331482 %

github-jeff commented 3 years ago

After 5.5K iterations: So I think we started learning here. mAP is at 0.39%, no nan, learning rate still at 0.004 (its still early) and average loss is just over ~2.5K but steadily falling. A little concerned about rewritten_bbox being ~5.3% and holding but all signs still point in the right direction for now. Let's continue to see how this turns out.

(next mAP calculation at 6528 iterations) 
 Last accuracy mAP@0.5 = 0.39 %, best = 0.39 % 
 5488: 2778.578125, 2729.678223 avg loss, 0.004000 rate, 29.238770 seconds, 702464 images, 4484.422736 hours left
Loaded: 0.000059 seconds
v3 (iou loss, Normalizer: (iou: 0.05, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.095368), count: 129, class_loss = 6301.260254, iou_loss = -4724.479492, total_loss = 1576.781006 
v3 (iou loss, Normalizer: (iou: 0.05, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.052948), count: 109, class_loss = 9.830446, iou_loss = 0.450158, total_loss = 10.280604 
v3 (iou loss, Normalizer: (iou: 0.05, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.070994), count: 57, class_loss = 4.640491, iou_loss = 7.454715, total_loss = 12.095206 
 total_bbox = 76966922, rewritten_bbox = 5.403668 %

github-jeff commented 3 years ago

After 9.5K iterations: Still looking good. mAP is holding at about 0.40 but we are getting detection on quite a few classes at this point. Yes, they are mostly just below 50% confidence but they are largely accurate. The bounding boxes are a little bit big but all signs point to successful training at this point. Thank you for the help we can safely close the ticket.

AlexeyAB / darknet

Larger Class Size Custom Training from scratch #7090

Training