Low mAP with large dataset

berserker commented 5 years ago

I'm training a network with only 1 class with a large dataset but I'm getting very lows mAP.

This is my actual config:

A recent version of this repo https://github.com/AlexeyAB/darknet, GPU=1, CUDA10
Nvidia RTX 2080Ti 11gb for train (I'm actually on Windows)
Default yolov3.cfg configuration with only the required changes for 1 class as described here, attachment
A quite large train dataset (~472000 images) and validation dataset (~50000 images): I managed to generate the images as an output from a rendering software so I have the luck to virtually generate any number of train/validation dataset, here it is some samples:
- Train sample 1:
- Train sample 1 with tagged regions (blue lines):
- Train sample 2:
- Train sample 2 with tagged regions (blue lines):
As you can see from the above samples the target of the network is to identify those "vertical" lines (with +/- 1° orientation, little "deformation"/"distortion" in the lines could also happen and of course there are other train samples with those features) that generally are of width 1px (max is 2px or 3px).
I'm actually using default anchors, anyway this is the calc anchors output and graph:

Now the problem: after ~250000 iterations I'm only getting ~27% of mAP as you can see in the following chart chart

Questions:

Each train/validation image is 1024x1024px and in the detection phase we will have the same input resolution: could the detection be affected by the network rescaling to 416x416? As far as I can see the details are still visible at that resolution, here it is a rescaled sample:
Is it a good idea in this scenario to enable random=1 (actually I'm using it since the configuration is the "plain" yolov3).
Is yolov3.cfg the best option in my case? Do you suggest another configuration that fits best in this case (i.e.: yolov3_5l.cfg, yolov3-spp.cfg)?
Do you suggest to increase the network resolution for better precision/mAP increase? To what value? Please note that I need a good frame rate in the detection phase with a medium/range hardware (i.e.: nvidia 2060ti). 1024x1024 resolution would have a very bad fps...is "tiny" version at 1024x1024 more suitable in this case?
I'm getting lots of nans while training, is this related to the fact that regions width are very small (generally 1px)? I suspect that rescaling to 416x416 is the problem in this case right? Here it is the train's sample output:

Thanks for your help!

AlexeyAB commented 5 years ago

@berserker Hi,

If your objects have width=1 pixel on 1024x1024 image, then you should train the yolov3-tiny.cfg model with width=1024 height=1024. (otherwise your objects will be removed/smoothed during resizing to 416x416.)
If all your Training/Validation/Test images have the same size, then you can train with random=0
You should use width=1024 height=1024 so may be yolov3-tiny.cfg or yolov3-tiny-3l.cfg
Yes, you must increase network resolution to 1024x1024. I also suggest you to use default anchors, but set first value of each anchor to the 1 anchors = 1,14, 1,27, 1,58, 1,82, 1,169, 1,319 instead of https://github.com/AlexeyAB/darknet/blob/7a854302efb7adba80d5e8a747ad5e5ec384a226/cfg/yolov3-tiny.cfg#L134
If you get nan not in the loss, then don't pay attention to it.

Also if you have to high speed with yolov3-tiny.cfg width=1024 height=1024, then you can sell you speed to the accuracy, just for example use 6 layers:

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

instead of these 2 layers: https://github.com/AlexeyAB/darknet/blob/7a854302efb7adba80d5e8a747ad5e5ec384a226/cfg/yolov3-tiny.cfg#L107-L121

berserker commented 5 years ago

Many thanks @AlexeyAB for your support!

Please help me in clarifing some more doubts:

1. If your objects have width=1 pixel on 1024x1024 image, then you should train the `yolov3-tiny.cfg` model with `width=1024 height=1024`. (otherwise your objects will be removed/smoothed during resizing to 416x416.)

Can you point me out to a doc/post that could help me to understand more deeply the pro/cons of yolov3.cfg versur yolov3-tiny.cfg (considering the same resolution 1024x1024)? I'm interested in particular to a comparisong of detection confidence, accuracy, performances, etc... Important: by now we have a network with only one class but in the future we plan to extend the model with 2 or 3 more classes that should have more "traditional" dimensions (i.e. 30x30, 40x50, etc...). What kind of approach do you suggest in this case considering that we must always support this first class (with 1 to 3 pixels width)? I see 3 options for the future plan, please correct me if I'm wrong:

Same yolov3-tiny with all of your advices even with new classes that will have different "shapes" (unlikely)
Hybrid approach: yolov3-tiny (maybe with some more tweaed params) or another configuration (i.e.: yolov3-spp?) but only with 1 network
2 networks: 1 that will manage this "particular" class (with all of your suggestion) and another complete different trained network that will handle the new 2/3 classes (this will have a huge impact on the performance I know, I'm only asking to undestand if it's the suggested approach).

3. You should use `width=1024 height=1024` so may be yolov3-tiny.cfg or yolov3-tiny-3l.cfg

What's the difference of yolov3-tiny.cfg over yolov3-tiny-3l.cfg?

4. Yes, you must increase network resolution to 1024x1024. I also suggest you to use default anchors, but set first value of each anchor to the 1
   `anchors = 1,14,  1,27,  1,58,  1,82,  1,169,  1,319`

Thanks, I'll try it for the first "release" with 1 class only support. I have the same doubts for the future release with the support of 2/3 more classes anyway: how does this suggestion fit? Should I go back to the default anchors then?

Also if you have to high speed with yolov3-tiny.cfg width=1024 height=1024, then you can sell you speed to the accuracy, just for example use 6 layers:

Is this a sort of 2x yolov3-tiny-3l.cfg implementation? I really need to understand more deeply this sry...

AlexeyAB commented 5 years ago

@berserker

I have the same doubts for the future release with the support of 2/3 more classes anyway: how does this suggestion fit? Should I go back to the default anchors then?

If you want to train model for 2/3 more classes, then you should recalculate anchors: https://github.com/AlexeyAB/darknet#how-to-improve-object-detection

And you must train your model from the begining for all classes: https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

I think it is a better to use one model yolov3-tiny.cfg 1024x1024 for all classes.

Can you point me out to a doc/post that could help me to understand more deeply the pro/cons of yolov3.cfg versur yolov3-tiny.cfg (considering the same resolution 1024x1024)? I'm interested in particular to a comparisong of detection confidence, accuracy, performances, etc...

Is this a sort of 2x yolov3-tiny-3l.cfg implementation? I really need to understand more deeply this sry...

What do you mean? You must use 1024x1024 network resolution in any case, if your objects have size 1xN on 1024x1024 images. Otherwise, if you will use 416x416 network resolution, then your network will not see small or thin objecs like your lines.

What's the difference of yolov3-tiny.cfg over yolov3-tiny-3l.cfg?

yolov3-tiny-3l.cfg just have 3 yolo-layers instead of 2 yolo-layers in yolov3-tiny.cfg. It allows yolo to detect smaller objects.

berserker commented 5 years ago

I think it is a better to use one model yolov3-tiny.cfg 1024x1024 for all classes.

Thanks, I'll have a try with that!.

Can you point me out to a doc/post that could help me to understand more deeply the pro/cons of yolov3.cfg versur yolov3-tiny.cfg (considering the same resolution 1024x1024)? I'm interested in particular to a comparisong of detection confidence, accuracy, performances, etc...

Is this a sort of 2x yolov3-tiny-3l.cfg implementation? I really need to understand more deeply this sry...

What do you mean? You must use 1024x1024 network resolution in any case, if your objects have size 1xN on 1024x1024 images. Otherwise, if you will use 416x416 network resolution, then your network will not see small or thin objecs like your lines.

I mean in particular detailed differences of each configuration (default, tiny, spp, etc...) in terms of confidence, accuracy, performance and so on. The request wasn't related to my specific case, it was only a general advice to pickup the best configuration for a given task.

What's the difference of yolov3-tiny.cfg over yolov3-tiny-3l.cfg?

yolov3-tiny-3l.cfg just have 3 yolo-layers instead of 2 yolo-layers in yolov3-tiny.cfg. It allows yolo to detect smaller objects.

Thanks, so I think that yolov3-tiny-3l.cfg is more suitable in my case because 1px width images right?

AlexeyAB commented 5 years ago

Thanks, so I think that yolov3-tiny-3l.cfg is more suitable in my case because 1px width images right?

Your objects have size 1x14 - 1x319, so 1x14 is a small object, but 1x319 is small & big object.

May be yolov3-tiny-3l.cfg will work better for 1x14 objects than yolov3-tiny and have the same accuracy for 1x319.

So try to use yolov3-tiny-3l.cfg width=1024 height=1024

berserker commented 5 years ago

Your objects have size 1x14 - 1x319, so 1x14 is a small object, but 1x319 is small & big object.

May be yolov3-tiny-3l.cfg will work better for 1x14 objects than yolov3-tiny and have the same accuracy for 1x319.

So try to use yolov3-tiny-3l.cfg width=1024 height=1024

Thanks again @AlexeyAB for your support, you were really very kind!

AlexeyAB / darknet

Low mAP with large dataset #2593