WongKinYiu / yolov7

Implementation of paper - YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
GNU General Public License v3.0
13.36k stars 4.22k forks source link

What is different in P5 model and P6 model? #141

Open Barry-Chen-yup opened 2 years ago

Barry-Chen-yup commented 2 years ago

Train code was seperated to train.py and train_auc.py. I do not know what different and how to use.

WongKinYiu commented 2 years ago

P5 models output P3, P4, and P5 prediction. P6 model output P3, P4, P5, and P6 prediction.

train.py is used to train Detect, IDetect, IBin heads. train_aux.py is used to train IAuxDetect head.

JarvisKevin commented 2 years ago

And I found hyp.scratch.p6.yaml is the same as hyp.scratch.p5.yaml

pathikg commented 2 years ago

P5 models output P3, P4, and P5 prediction. P6 model output P3, P4, P5, and P6 prediction.

train.py is used to train Detect, IDetect, IBin heads. train_aux.py is used to train IAuxDetect head.

Can you please tell me what is P3, P4, P5 and P6? and where can I read more about such terms from the context of yolov7?

dadin852 commented 1 year ago

train.py for img-size 640 train_aux.py for img-size 1280

mazatov commented 1 year ago

Is the image size really the only difference? Because I can run train.py with 1280 no problem and it works.

pathikg commented 1 year ago

Is the image size really the only difference? Because I can run train.py with 1280 no problem and it works.

I don't think so I was looking out for the same I found the following:

One stage detector like YOLO have the following stages image

All object detectors take an image in for input and compress features down through a convolutional neural network backbone. In image classification, these backbones are the end of the network and prediction can be made off of them. In object detection, multiple bounding boxes need to be drawn around images along with classification, so the feature layers of the convolutional backbone need to be mixed and held up in light of one another. The combination of backbone feature layers happens in the neck. It is also useful to split object detectors into two categories: one-stage detectors and two stage detectors. Detection happens in the head. Two-stage detectors decouple the task of object localization and classification for each bounding box. One-stage detectors make the predictions for object localization and classification at the same time. YOLO is a one-stage detector, hence, You Only Look Once.

let's take a look at Neck e.g. BiFPN image it is extracting features from different feature layers and those layers are marked as P4, P5, P6, etc. so layers like P6 are responsible for extracting features in smaller areas of the image while layer like P4 is responsible for extracting features in larger areas

ref: https://blog.roboflow.com/a-thorough-breakdown-of-yolov4/ https://towardsdatascience.com/review-fpn-feature-pyramid-network-object-detection-262fc7482610

Please correct me If I am wrong :)

mazatov commented 1 year ago

@pathikg , that make sense to me. If you look at cfg\training\yolov7.yaml you can see that several layers are marked as P1, P2, P3, P4, and P5. It must be the features it is extracting from those layers. Same for other config files and for example yolov7-w6.yaml has one feature marked as P6.

Funnily enough I got the best result when I was trining using train.py but by accident put yolov7-w6_training.pt weights to start with. Not sure what happened there...

This is the backbone of yolov7 basic

backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [32, 3, 1]],  # 0

   [-1, 1, Conv, [64, 3, 2]],  # 1-P1/2      
   [-1, 1, Conv, [64, 3, 1]],

   [-1, 1, Conv, [128, 3, 2]],  # 3-P2/4  
   [-1, 1, Conv, [64, 1, 1]],
   [-2, 1, Conv, [64, 1, 1]],
   [-1, 1, Conv, [64, 3, 1]],
   [-1, 1, Conv, [64, 3, 1]],
   [-1, 1, Conv, [64, 3, 1]],
   [-1, 1, Conv, [64, 3, 1]],
   [[-1, -3, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [256, 1, 1]],  # 11

   [-1, 1, MP, []],
   [-1, 1, Conv, [128, 1, 1]],
   [-3, 1, Conv, [128, 1, 1]],
   [-1, 1, Conv, [128, 3, 2]],
   [[-1, -3], 1, Concat, [1]],  # 16-P3/8  
   [-1, 1, Conv, [128, 1, 1]],
   [-2, 1, Conv, [128, 1, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [[-1, -3, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [512, 1, 1]],  # 24

   [-1, 1, MP, []],
   [-1, 1, Conv, [256, 1, 1]],
   [-3, 1, Conv, [256, 1, 1]],
   [-1, 1, Conv, [256, 3, 2]],
   [[-1, -3], 1, Concat, [1]],  # 29-P4/16  
   [-1, 1, Conv, [256, 1, 1]],
   [-2, 1, Conv, [256, 1, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [[-1, -3, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [1024, 1, 1]],  # 37

   [-1, 1, MP, []],
   [-1, 1, Conv, [512, 1, 1]],
   [-3, 1, Conv, [512, 1, 1]],
   [-1, 1, Conv, [512, 3, 2]],
   [[-1, -3], 1, Concat, [1]],  # 42-P5/32  
   [-1, 1, Conv, [256, 1, 1]],
   [-2, 1, Conv, [256, 1, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [[-1, -3, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [1024, 1, 1]],  # 50
  ]