YOLOv4 Leaky training diverges

willbattel commented 4 years ago

I'm trying to train YOLOv4 and YOLOv4 Leaky on a custom dataset. YOLOv4 is working, but YOLOv4 Leaky is diverging during training (nan avg-loss). I pulled the cfg file from the Model Zoo and changed the parameters to the same values in my YOLOv4 cfg file. The only difference between the two files is the mish/leaky activations specified in the Model Zoo files.

./darknet detector train data/custom/data.txt cfg/yolov4-leaky-custom.cfg yolov4.conv.137 -map -dont_show
 CUDA-version: 10000 (10020), cuDNN: 7.4.2, CUDNN_HALF=1, GPU count: 1  
 CUDNN_HALF=1 
 OpenCV version: 3.2.0
 Prepare additional network for mAP calculation...
 0 : compute_capability = 700, cudnn_half = 1, GPU: Tesla V100-SXM2-16GB 
net.optimized_memory = 0 
mini_batch = 1, batch = 64, time_steps = 1, train = 0

[net]
# Testing
#batch=1
#subdivisions=1
# Training
batch=64
subdivisions=64
width=576
height=1024
channels=3
momentum=0.949
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.00261
burn_in=1000
max_batches = 25000
policy=steps
steps=20000,22500
scales=.1,.1

mosaic=1

Compare with YOLOv4 cfg - https://www.diffchecker.com/8ulGrJk7

Two ideas I had for maybe why this is happening,

Is darknet trying to train the alpha value in the activations? In other words, is darknet using LReLU or PReLU. If darknet is trying to train the activation layers (PReLU), then the base weights in yolov4.conv.137 won't work. I didn't see any other weights to use for YOLOv4 Leaky so I used the same as YOLOv4.
Is the learning rate too high for Leaky? Both cfg files use the same learning rate, but maybe that value isn't appropriate for YOLOv4 Leaky. Have you tested the learning rate on v4 Leaky? I read your paper for YOLOv4 and I'm wondering if the proper learn rate was found for YOLOv4 Leaky or only for YOLOv4.

Otherwise, I'm not sure why v4 Leaky is failing. Normal v4 is training perfectly fine and is configured nearly identically.

willbattel commented 4 years ago

I lowered the LR from the default 0.00261 to 0.00200 and so far it seems to be running nominally. I'll keep watching it.

lovepan1 commented 3 years ago

hello， i also meet this problem, i use yolov4-leaky.cfg and yolov4.weights to train model, after some iters, loss become nan. I lowered the LR from the default 0.00261 to 0.00100,, but no difference. can u solve this problem?@willbattel

willbattel commented 3 years ago

Mine worked with LR of 0.002, so I'm not sure what the problem is in your case. I would try on other models, like normal v4, and see if you have the same issue or if it's only with v4 Leaky.

pupu2014 commented 3 years ago

Maybe you need to use cd53paspp-gamma_final.weights as the preweights.because of the difference of leaky and swish.

AlexeyAB / darknet

YOLOv4 Leaky training diverges #6599