training wasn't converging with visdrone2019 dataset

gameliee commented 5 years ago

I've try to training v3-tiny model with the visdrone2019 dataset. It doesn't seem converging so far. Could you kindly give me some advice. Thanks a lot.

Data: The objects in this dataset is quite small. When calculating anchors point with the size of 416x416, the results was anchors = 3, 6, 6, 13, 14, 11, 13, 27, 28, 31, 49, 64

What I've done: recalculated anchors, verified annotations to be correct, changed saturation = 1.8, exposure = 1.8, jiters = .8 and changed learning rate a bit.

The chart chart

The console output:

(next mAP calculation at 1808 iterations) 1592: nan, nan avg loss, 0.001000 rate, 0.903414 seconds, 101888 images Loaded: 0.000038 seconds Region 16 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 42 Region 23 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 104 OpenCV can't augment image: 480 x 480 Region 16 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0 Region 23 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 83 OpenCV can't augment image: 480 x 480 OpenCV can't augment image: 480 x 480 Region 16 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0

here is the config

[net]
# Testing
# batch=1
# subdivisions=1
# Training
batch=64
subdivisions=16
width=416
height=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.8
exposure = 1.8
hue=.1

learning_rate=0.01
burn_in=1000
max_batches = 100000
policy=steps
steps=1000,40000,80000
scales=.1,.1,.1

[convolutional]
batch_normalize=1
filters=16
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=1

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

###########

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=51
activation=linear

[yolo]
mask = 3,4,5
anchors = 3,  6,   6, 13,  14, 11,  13, 27,  28, 31,  49, 64
classes=12
num=6
jitter=.8
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -4

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2
# stride=2

[route]
layers = -1, 8
# layers = -1, 8

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=51
activation=linear

[yolo]
mask = 0,1,2
anchors = 3,  6,   6, 13,  14, 11,  13, 27,  28, 31,  49, 64
classes=12
num=6
jitter=.8
ignore_thresh = .7
truth_thresh = 1
random=1
max=200

AlexeyAB commented 5 years ago

@ntd94 Hi,

OpenCV can't augment image: 480 x 480 OpenCV can't augment image: 480 x 480

It means that some of your images are broken.

Can you show content of files? bad.list bad_label.list

Firstly, try to train by using default model https://raw.githubusercontent.com/AlexeyAB/darknet/master/cfg/yolov3-tiny_3l.cfg with default params
Also run training with flag -show_imgs do you see correct labels on images?

gameliee commented 5 years ago

Hi @AlexeyAB , Thanks for your reply. I've done tasks as you recommended.

I've checked bad_label.list file and delete all corresponding images. There wasn't bad.list file.
Done training with the yolov3-tiny_3l config. After 4 days of training, the result still hasn't looked promising.
Train with -show_imgs flag, the console stuck here

What should I do now?

AlexeyAB commented 5 years ago

@ntd94

It seems that you should train with higher resolution.

Or try to train yolov3-tiny_3l.cfg with width=1024 height=1024
Or yolov3-spp.cfg with width=832 height=832 random=0
Or if you will use just this repository, better to train new models: https://github.com/AlexeyAB/darknet/issues/3114#issuecomment-494148968
- for example https://github.com/AlexeyAB/darknet/files/3253820/yolo_v3_spp_pan_scale.cfg.txt just try to set the highest possible resolution (multiple of 32)

gameliee commented 5 years ago

@AlexeyAB

I've tried to train yolov3-tiny_3l.cfg with width = 1024 hight = 1024 and the results is: (there are 2 images because I got segment fault once)
Currently, I'm referring to this repo (nvidia trt yolo) to implement inference phase, so I'm not sure whether I can implement the -spp.cfg or the LTSM ones. I want to implement a module that do object detection on videos with 30FPS on jetson TX2. Could you give me any advises, it would be appreciate. Thank you, in advance.

AlexeyAB commented 5 years ago

@ntd94

With TRT you can't use PAN, Trident, LSTM network currently.

Try to use default SPP-model https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-spp.cfg and https://pjreddie.com/media/files/yolov3-spp.weights with https://github.com/NVIDIA-AI-IOT/deepstream_reference_apps/blob/master/yolo/README.md does it work successfully?

AlexeyAB / darknet

training wasn't converging with visdrone2019 dataset #3267