AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.8k stars 7.97k forks source link

Questions about YOLOv4-tiny transfer learning #6756

Closed KyryloAntoshyn closed 4 years ago

KyryloAntoshyn commented 4 years ago

Hi, @AlexeyAB! I've performed my first "test" transfer learning on 3 classes: Person, Human leg and Human foot from Google Open Images Dataset. I was following your guidelines in the README. They are pretty straightforward, you did an excellent work, thanks!

My dataset has 2000 random training and 400 random validation images for class "Person". Moreover, I have some training and validation images for classes "Human leg" and "Human foot", but less then 2000 and 400 respectively (this is because I was downloading images with flags: IsTruncated=0, IsGroupOf=0, IsDepiction=0 and IsInside=0, which helps to get more accurate images, but reduces the number of appropriate images).

I have some questions about training:

  1. After training there is _yolov4-tiny-OIDv4-Person-HumanLeg-HumanFootbest.weights file inside the backup directory. I ran the "map" command in order to calculate mAP value and had following results: Person – 53.18 %, Human leg – 34.25 % and Human foot – 1.50 %. These results are shown on the image below. image What formula do you use for mAP calculation? When I try to use mAP = TP / (TP + FP) formula I get different results: 270 / (270 + 183) = ~0.6 for class Person. Answer: @bavo96 answered. Actually, area under curve is calculated as described here: https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173.

  2. My training is actually not good, because I have only 53.18 % mAP for class Person (as shown on the image above) and my model detects persons worse then pre-trained on the COCO dataset model (confidences are less, bounding boxes are not so close to the objects). Is this due to small number of training images or there can be a lot of other reasons: lack of "negative" images, different images in training/validation datasets by shape, side of object, relative size, angle of rotation, tilt, illumination? Answer: probably, I need to check my dataset with Yolo_mark and it doesn't suit all Alexey's recommendations.

  3. I want to train the YOLOv4-tiny model for Pedestrian detection by combining COCO, BDD, Open Images. How many images do I need to consider in order to train the model successfully: 2000, 10000 or more? Do I need to use images from the COCO dataset despite the fact that the model is pre-trained on this dataset? Answer: I think minimum is 2000 (Alexey said: "...you should preferably have 2000 different images for each class or more") and I need to focus on accuracy and variety of pictures. Moreover, I suggest that it doesn't matter whether I use COCO dataset and keep only "Person" class, because COCO is very powerful dataset that contains objects in different contexts.

  4. As I understand, I can use pre-trained on COCO model in order detect only "Person" class, but in this case I need to filter detected objects in code. I can't adjust the yolov4-tiny.cfg file (change the number of classes and filters in [yolo] layers), because it won't suit the trained model. Is it correct? Answer: sure I need to filter "Person" class in code, because we can't change neural network structure.

  5. How the quality of training differs from subdivisions = 16, 32 or 64? For yolov4-tiny-custom.cfg subdivisions=1. Do I need to change it to 16 as stated in the README "How to train" section? Answer: I've found Alexey's past answer: "Subdivisions almost doesn't affect on accuracy. It affects only on speed." https://github.com/AlexeyAB/darknet/issues/2268.

  6. "Out of memory" error can occur due to small subdivisions number, because GPU doesn't have enough memory to process a lot of mini_batch samples at once, right? Answer: @stephanecharette answered.

  7. Repo contains yolov4-tiny-3l.cfg model. This model better detects small objects due to 3 [yolo] layers сompared with yolov4-tiny.cfg, right? Answer: @stephanecharette answered.

  8. On the graph I can see "C" value. Is it contrastive loss? image Answer: @stephanecharette answered.

Thank you in advance!

stephanecharette commented 4 years ago

(I don't have an answer for 1-5.)

  1. "Out of memory" error can occur due to small subdivisions number, because GPU doesn't have enough memory to process a lot of mini_batch samples at once, right?

Yes, see: https://www.ccoderun.ca/programming/2020-09-25_Darknet_FAQ/#cuda_out_of_memory

  1. Repo contains yolov4-tiny-3l.cfg model. This model better detects small objects due to 3 [yolo] layers сompared with yolov4-tiny.cfg, right?

Yes.

  1. On the graph I can see "C" value. Is it contrastive loss?

Yes, see: #6290 which links to #6004.

bavo96 commented 4 years ago
  1. What formula do you use for mAP calculation? When I try to use mAP = TP / (TP + FP) formula I get different results: 270 / (270 + 183) = ~0.6 for class Person.

The mAP formula you've shown is actually the precision, not the mean Average Precision (mAP). Check wiki for more information https://en.wikipedia.org/wiki/Precision_and_recall image You can checkout this link to know what mAP is: https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173.

KyryloAntoshyn commented 4 years ago

Thank you, guys @stephanecharette @bavo96!

KyryloAntoshyn commented 4 years ago

@stephanecharette hi! I have one question. How many images do I need to consider in order to train my model (person detection with YOLOv4-tiny)? Is there minimum number of images?

stephanecharette commented 4 years ago

Minimum? Just 1. But your network will only recognize (more-or-less) that 1 image.

The more variation you want to recognize, the more images you'll need. If you want to recognize different people on the street for example, then you normally will be talking about thousands of images.

KyryloAntoshyn commented 4 years ago

Minimum? Just 1. But your network will only recognize (more-or-less) that 1 image.

The more variation you want to recognize, the more images you'll need. If you want to recognize different people on the street for example, then you normally will be talking about thousands of images.

Thanks, I need to have a lot of variations!