Dataset Issue or Detection Limitation

deimsdeutsch commented 6 years ago

Since Yolov3 has been trained MSCOCO and most of the people marked in the dataset are either close or visible to naked eye. If the dataset includes people marked quite far away in equal quality to the people marked in close will the object detection work ?

If this is not a dataset issue then why such scaling algorithms are producing better results ? https://github.com/peiyunh/tiny

Are we missing few image scaling functions in the training ?

AlexeyAB commented 6 years ago

If the dataset includes people marked quite far away in equal quality to the people marked in close will the object detection work ?

What do you mean by "equal quality"?

If this is not a dataset issue then why such scaling algorithms are producing better results ? https://github.com/peiyunh/tiny

Better than what? I don't know what AP can we get by training Yolo v3 on (FDDB and WIDER FACE).

As I see they used Resnet-101 as backbone, and detector at 3 scales. Yolo v3 used Feature Pyramids network (with residual connections) as backbone, and detector at 3 scales.

I scrolled through the article, but I did not find what value of subsampling they used for each of the three detectors, but as I see they used 3 resolution scales: https://github.com/peiyunh/tiny/blob/745a49cb38182a79f17d5090257914669d8e8089/tiny_face_detector.m#L85-L86

Are we missing few image scaling functions in the training ?

What do you mean?

In the Yolo v3, optimal size of object (and optimal anchors) for:

yolo Region 82 - 32x32 pixels - ~886x886 pixels https://github.com/AlexeyAB/darknet/blob/4403e71b330b42d3cda1e0721fb645cf41bac14f/cfg/yolov3.cfg#L607
yolo Region 94 - 16x16 pixels - ~518x518 pixels (up to ~886x886 pixels ) https://github.com/AlexeyAB/darknet/blob/4403e71b330b42d3cda1e0721fb645cf41bac14f/cfg/yolov3.cfg#L693
yolo Region 106 - 8x8 pixels - ~206x206 pixels (up to ~886x886 pixels ) https://github.com/AlexeyAB/darknet/blob/4403e71b330b42d3cda1e0721fb645cf41bac14f/cfg/yolov3.cfg#L780

You can change

this line: https://github.com/AlexeyAB/darknet/blob/6390a5a2ab61a0bdf6f1a9a6b4a739c16b36e0d7/cfg/yolov3.cfg#L720 to this: layers = -1, 4

and this line: https://github.com/AlexeyAB/darknet/blob/6390a5a2ab61a0bdf6f1a9a6b4a739c16b36e0d7/cfg/yolov3.cfg#L717 to this: stride=8 so yolo Region 106 will detect - 2x2 pixels - ~20x20 pixels (up to ~886x886 pixels )

and

yolo Region 94 - 16x16 pixels - ~518x518 pixels (up to ~886x886 pixels )
yolo Region 82 - 32x32 pixels - ~886x886 pixels

Then set width=1024 height=1024 in cfg-file, set random=1 for each of [yolo]-layers. Change this line: https://github.com/AlexeyAB/darknet/blob/6390a5a2ab61a0bdf6f1a9a6b4a739c16b36e0d7/src/detector.c#L132 to float random_val = rand_scale(2); or float random_val = rand_scale(4); And train about 1 000 000 iterations on (FDDB and WIDER FACE) with only 1 class (face).

Then you can try to get AP for face by using ./darknet detector map ...

deimsdeutsch commented 6 years ago

@AlexeyAB

Thanks for the info. Will this type of configuration increase the number of false positives ?

[net]
# Testing
#batch=1
#subdivisions=1
Training
batch=64
subdivisions=4
width=1024
height=1024
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
burn_in=1000
max_batches = 500200
policy=steps
steps=400000,450000
scales=.1,.1

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky

# Downsample

[convolutional]
batch_normalize=1
filters=64
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=32
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=128
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=64
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=64
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=256
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=512
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

######################

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=18
activation=linear

[yolo]
mask = 6,7,8
anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
classes=1
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -4

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 61

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=512
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=512
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=512
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=18
activation=linear

[yolo]
mask = 3,4,5
anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
classes=1
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -4

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=8

[route]
layers = -1, 4

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=256
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=256
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=256
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=18
activation=linear

[yolo]
mask = 0,1,2
anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
classes=1
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

AlexeyAB commented 6 years ago

Thanks for the info. Will this type of configuration increase the number of false positives ?

Usually yes, it increases True Positives and False Positives.

dexception commented 6 years ago

@AlexeyAB

[root@localhost darknet]# ./darknet detector calc_anchors data/facedetection/face.data -num_of_clusters 9 -width 1024 -height 1024

num_of_clusters = 9, width = 1024, height = 1024 read labels from 126747 images loaded image: 126747 box: 312094 all loaded.

calculating k-means++ ... i = 41222, box_w = 0, box_h = 0, anchor_w = 1139964740, anchor_h = 1138366106, iou = -11.000000 i = 95110, box_w = 0, box_h = 0, anchor_w = -829826680, anchor_h = 96, iou = -26.000000 i = 194408, box_w = 0, box_h = 0, anchor_w = -829826680, anchor_h = 88, iou = -2.000000

avg IoU = 57.68 %

Saving anchors to the file: anchors.txt anchors = 15.6696,27.8704, 69.0891,114.4718, 213.6109,267.2336, 365.7014,360.9078, 303.5794,528.7938, 484.9630,489.6827, 436.1766,743.6798, 605.8227,617.4738, 729.0613,799.2632

Can you please explain more about these values and anchors generated for the dataset ?

Thanks.

AlexeyAB commented 6 years ago

i = 41222, box_w = 0, box_h = 0, anchor_w = 1139964740, anchor_h = 1138366106, iou = -11.000000 i = 95110, box_w = 0, box_h = 0, anchor_w = -829826680, anchor_h = 96, iou = -26.000000 i = 194408, box_w = 0, box_h = 0, anchor_w = -829826680, anchor_h = 88, iou = -2.000000

You have at least 3 labels with width=0 height=-829826680 in your dataset: https://github.com/AlexeyAB/darknet/blob/6390a5a2ab61a0bdf6f1a9a6b4a739c16b36e0d7/src/detector.c#L999-L1002

Can you please explain more about these values and anchors generated for the dataset ?

Anchors are initial sizes of object. So Yolo v3 select one of anchor that most corresponds to the current object, and then just adjaust this anchors to the size of object, instead of predict absolute size of object.

AlexeyAB / darknet

Dataset Issue or Detection Limitation #892