AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.66k stars 7.96k forks source link

Can I train YOLOv4-P6 with different width and height? #8076

Open mjack3 opened 3 years ago

mjack3 commented 3 years ago

I am using the YOLOv4-P6.cfg and the pre-trained weight for training yolov4-p6.conv.289. I have just one classe so, following the step by step y changed classes=1, filters=24, width=1152 and height=800

Then i run this command

./darknet detector train data/obj_iii.data cfg/scale/yolov4-p6_III.cfg pre-trained/yolov4-p6.conv.289 and i get this error

 206 route  205 203 The width and height of the input layers are different. 
                               ->    0 x   0 x   0 
 207 Darknet error location: ./src/parser.c, parse_convolutional, line #208
Layer before convolutional layer must output image.: Success

However, i can train the yolov4-p5 (with different width and height)

lsd1994 commented 3 years ago

The width and height should be multiple of 64? I'm not sure.

KalanaRatnayake commented 3 years ago

I think network size needs to be multiples of 32. But when i tried low resolutions with p5, p6 and p7, not all worked. It seems to me that, half and quarters of given resolution works. Eg: p6 (1280x1280) -> 640x640 and 320x320 worked but 416x416 didn't. Couldn't complete the test as my GPU memory was not enough to train the models. I was barely able to load the network.

KalanaRatnayake commented 3 years ago

And it seems they need to be similar. width=height. so in your case you would have to try with 1280x1280 and feed the 1152x800 image. Internally they add padding to the image so that it becomes 1280x1280. i read it somewhere but cannot remember where exactly. check in the main README.md

lsd1994 commented 3 years ago

In yolov4 the width and height must be multiple of 32 because it downsamples 5 times. So you should check how many downsample times in this model, and then you can get right resolution.

mjack3 commented 3 years ago

@lsd1994 and what is the relation of 32 with the number of downsamples?

lsd1994 commented 3 years ago

2^5=32, so 32 pixels in original image become 1 pixel in last feature map after 5 times downsample.

Micky-123 commented 2 years ago

I got the same issue while trying to retrain yolov4-tiny with 640*360. I then modified 360 to 384 ( to make sure it is multiple of 32) and it worked.