AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.65k stars 7.96k forks source link

About max-pooling layer in Yolov3-SPP.cfg #4475

Open May-forever opened 4 years ago

May-forever commented 4 years ago

Hi @AlexeyAB ,

When I use Yolov3-SPP.cfg for trainning my custom dataset, I find a strange thing.

In the 78 layer of Yolov3-SPP.cfg (i.e., the first max-pooling layer of SPP), the size of input feature

map is 19 x 19 x 512 . By using the 5 x 5/ 1 max-pooling operation, the output size is still 19 x 19

x 512.

However, if the size of input feature map is W H C (WidthxHeightxChannel), and the parameters

of max-pooling contain F (Filter size), S (stride). The size of the output feature should be:

W = [(W - F )/S] +1 and H = [(H - F )/S] +1

On the above basis, the output size of 78 layer of Yolov3-SPP.cfg should be:

W=14= [(19 - 5)/1] +1 and H=14= [(19 - 5)/1], i.e., it should be 14x14x512

Could you please give me some help for my understanding about why there are some differences in

the output size of 78 layer?

Looking forward to hearing from you, thanks a lot in advance.

****Below is the details of 78 layer to 83 layer of Yolov3-SPP.cfg** 78 max 5 x 5/ 1 19 x 19 x 512 -> 19 x 19 x 512 0.005 BF 79 route 77 80 max 9 x 9/ 1 19 x 19 x 512 -> 19 x 19 x 512 0.015 BF 81 route 77 82 max 13 x 13/ 1 19 x 19 x 512 -> 19 x 19 x 512 0.031 BF 83 route 82 80 78 77

AlexeyAB commented 4 years ago

However, if the size of input feature map is W H C (WidthxHeightxChannel), and the parameters

of max-pooling contain F (Filter size), S (stride). The size of the output feature should be:

W = [(W - F )/S] +1 and H = [(H - F )/S] +1

On the above basis, the output size of 78 layer of Yolov3-SPP.cfg should be:

W=14= [(19 - 5)/1] +1 and H=14= [(19 - 5)/1], i.e., it should be 14x14x512

Why do you think so? Should be: https://github.com/AlexeyAB/darknet/blob/c7e3ba3ed41e9fd114390263f9dd1657b71f676c/src/maxpool_layer.c#L78-L80

May-forever commented 4 years ago

However, if the size of input feature map is W H C (WidthxHeightxChannel), and the parameters of max-pooling contain F (Filter size), S (stride). The size of the output feature should be: W = [(W - F )/S] +1 and H = [(H - F )/S] +1 On the above basis, the output size of 78 layer of Yolov3-SPP.cfg should be: W=14= [(19 - 5)/1] +1 and H=14= [(19 - 5)/1], i.e., it should be 14x14x512

Why do you think so? Should be: https://github.com/AlexeyAB/darknet/blob/c7e3ba3ed41e9fd114390263f9dd1657b71f676c/src/maxpool_layer.c#L78-L80

Hi @AlexeyAB ,

Thank you very much for your reply.

If l.out_w = (w + padding - size) / stride_x + 1, and 'size' indicates the pooling kernal size.

the value of padding should always be: padding= size-1.

i.e., W=19=[(19+4-5)/1]+1.

Am I right ?

Looking forward to hearing from you, than you very much.

AlexeyAB commented 4 years ago

the value of padding should always be: padding= size-1.

Yes, so

l.out_w = (w + padding - size) / stride_x + 1 = (w + size - 1 - size) / 1 + 1 = (w-1) + 1 = w

So output_w == input_w

May-forever commented 4 years ago

the value of padding should always be: padding= size-1.

Yes, so

l.out_w = (w + padding - size) / stride_x + 1 = (w + size - 1 - size) / 1 + 1 = (w-1) + 1 = w

So output_w == input_w

Thank you very much for your help

AlexeyAB commented 4 years ago

Also read about 2 types of Padding: SAME and VALID https://stackoverflow.com/questions/37674306/what-is-the-difference-between-same-and-valid-padding-in-tf-nn-max-pool-of-t

May-forever commented 4 years ago

Also read about 2 types of Padding: SAME and VALID https://stackoverflow.com/questions/37674306/what-is-the-difference-between-same-and-valid-padding-in-tf-nn-max-pool-of-t

Ok, thank you very much.

Olivia-V commented 4 years ago

Also read about 2 types of Padding: SAME and VALID https://stackoverflow.com/questions/37674306/what-is-the-difference-between-same-and-valid-padding-in-tf-nn-max-pool-of-t

Hi,In the bottom lines of Page 4 of 'SlimYOLOv3: Narrower, Faster and Better for Real-Time UAV Applications', it claims 'SPP module is able to extract multiscale deep features with different receptive fields and fuse them by concatenating them in the channel dimension of feature maps.'. However, when I train yolov3-spp, the output of each max-pooling layer in SPP structure is same, which means the receptive fields of all the outputs of max-pooling layer in SPP is the same size. So I am so confused with the sentence 'with different receptive fields'. Could you please give me some guidance or inspiration?Thanks~

AlexeyAB commented 4 years ago

@Olivia-V Receptieve field of each output of maxpool 2 x 2 is 2x2 pixels Receptieve field of each output of maxpool 7 x 7 is 7x7 pixels

Olivia-V commented 4 years ago

@Olivia-V Receptieve field of each output of maxpool 2 x 2 is 2x2 pixels Receptieve field of each output of maxpool 7 x 7 is 7x7 pixels

Hi, @AlexeyAB , thanks a lot. However, for example, the input of the 80 layer in yolov3-spp is 19 x 19 x 512, and the output of the 80 layer in yolov3-spp is still 19 x 19 x 512. According to my understanding, because each point in the output feature map is corresponding to each point in the input feature map, the receptive field should be 1x1 pixels. but according to your inspiration, the receptive field should be 9 x 9. That means the padding pixels are not been included in the calculation of receptive field.

Am I right or wrong ? If I am wrong, please point out. Thanks in advance and merry Christmas. :)

78 max 5 x 5/ 1 19 x 19 x 512 -> 19 x 19 x 512 0.005 BF 79 route 77 80 max 9 x 9/ 1 19 x 19 x 512 -> 19 x 19 x 512 0.015 BF 81 route 77 82 max 13 x 13/ 1 19 x 19 x 512 -> 19 x 19 x 512 0.031 BF 83 route 82 80 78 77

AlexeyAB commented 4 years ago

@Olivia-V

According to my understanding, because each point in the output feature map is corresponding to each point in the input feature map,

Why do you think so? What does it mean 5x5 in max 5x5 ? What does it mean 5x5 in conv 5x5 ? If during training, 24 weights became 0 except 1 weight that is equal 0.5, then what is the receptieve field of dw-conv 5x5 ? What is the receptieve field of conv 5x5 ? Are you sure that all 25 weights are not zero?

Olivia-V commented 4 years ago

@Olivia-V

According to my understanding, because each point in the output feature map is corresponding to each point in the input feature map,

Why do you think so? What does it mean 5x5 in max 5x5 ? What does it mean 5x5 in conv 5x5 ? If during training, 24 weights became 0 except 1 weight that is equal 0.5, then what is the receptieve field of dw-conv 5x5 ? What is the receptieve field of conv 5x5 ? Are you sure that all 25 weights are not zero?

Oh, yes! you are right, thank you very much for your help.