Open deimsdeutsch opened 6 years ago
If the dataset includes people marked quite far away in equal quality to the people marked in close will the object detection work ?
What do you mean by "equal quality"?
If this is not a dataset issue then why such scaling algorithms are producing better results ? https://github.com/peiyunh/tiny
Better than what? I don't know what AP can we get by training Yolo v3 on (FDDB and WIDER FACE).
As I see they used Resnet-101 as backbone, and detector at 3 scales. Yolo v3 used Feature Pyramids network (with residual connections) as backbone, and detector at 3 scales.
I scrolled through the article, but I did not find what value of subsampling they used for each of the three detectors, but as I see they used 3 resolution scales: https://github.com/peiyunh/tiny/blob/745a49cb38182a79f17d5090257914669d8e8089/tiny_face_detector.m#L85-L86
Are we missing few image scaling functions in the training ?
What do you mean?
In the Yolo v3, optimal size of object (and optimal anchors) for:
yolo Region 82 - 32x32 pixels - ~886x886 pixels https://github.com/AlexeyAB/darknet/blob/4403e71b330b42d3cda1e0721fb645cf41bac14f/cfg/yolov3.cfg#L607
yolo Region 94 - 16x16 pixels - ~518x518 pixels (up to ~886x886 pixels ) https://github.com/AlexeyAB/darknet/blob/4403e71b330b42d3cda1e0721fb645cf41bac14f/cfg/yolov3.cfg#L693
yolo Region 106 - 8x8 pixels - ~206x206 pixels (up to ~886x886 pixels ) https://github.com/AlexeyAB/darknet/blob/4403e71b330b42d3cda1e0721fb645cf41bac14f/cfg/yolov3.cfg#L780
You can change
this line: https://github.com/AlexeyAB/darknet/blob/6390a5a2ab61a0bdf6f1a9a6b4a739c16b36e0d7/cfg/yolov3.cfg#L720
to this: layers = -1, 4
and this line: https://github.com/AlexeyAB/darknet/blob/6390a5a2ab61a0bdf6f1a9a6b4a739c16b36e0d7/cfg/yolov3.cfg#L717
to this: stride=8
so yolo Region 106 will detect - 2x2 pixels - ~20x20 pixels (up to ~886x886 pixels )
and
Then set width=1024 height=1024
in cfg-file, set random=1
for each of [yolo]-layers.
Change this line: https://github.com/AlexeyAB/darknet/blob/6390a5a2ab61a0bdf6f1a9a6b4a739c16b36e0d7/src/detector.c#L132
to float random_val = rand_scale(2);
or float random_val = rand_scale(4);
And train about 1 000 000 iterations on (FDDB and WIDER FACE) with only 1 class (face).
Then you can try to get AP for face by using ./darknet detector map ...
@AlexeyAB
Thanks for the info. Will this type of configuration increase the number of false positives ?
[net]
# Testing
#batch=1
#subdivisions=1
Training
batch=64
subdivisions=4
width=1024
height=1024
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
learning_rate=0.001
burn_in=1000
max_batches = 500200
policy=steps
steps=400000,450000
scales=.1,.1
[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky
# Downsample
[convolutional]
batch_normalize=1
filters=64
size=3
stride=2
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=32
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
# Downsample
[convolutional]
batch_normalize=1
filters=128
size=3
stride=2
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=64
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
[convolutional]
batch_normalize=1
filters=64
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
# Downsample
[convolutional]
batch_normalize=1
filters=256
size=3
stride=2
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
# Downsample
[convolutional]
batch_normalize=1
filters=512
size=3
stride=2
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
# Downsample
[convolutional]
batch_normalize=1
filters=1024
size=3
stride=2
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky
[shortcut]
from=-3
activation=linear
######################
[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky
[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky
[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky
[convolutional]
size=1
stride=1
pad=1
filters=18
activation=linear
[yolo]
mask = 6,7,8
anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326
classes=1
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1
[route]
layers = -4
[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky
[upsample]
stride=2
[route]
layers = -1, 61
[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=512
activation=leaky
[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=512
activation=leaky
[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=512
activation=leaky
[convolutional]
size=1
stride=1
pad=1
filters=18
activation=linear
[yolo]
mask = 3,4,5
anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326
classes=1
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1
[route]
layers = -4
[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky
[upsample]
stride=8
[route]
layers = -1, 4
[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=256
activation=leaky
[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=256
activation=leaky
[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=256
activation=leaky
[convolutional]
size=1
stride=1
pad=1
filters=18
activation=linear
[yolo]
mask = 0,1,2
anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326
classes=1
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1
Thanks for the info. Will this type of configuration increase the number of false positives ?
Usually yes, it increases True Positives and False Positives.
@AlexeyAB
[root@localhost darknet]# ./darknet detector calc_anchors data/facedetection/face.data -num_of_clusters 9 -width 1024 -height 1024
num_of_clusters = 9, width = 1024, height = 1024 read labels from 126747 images loaded image: 126747 box: 312094 all loaded.
calculating k-means++ ... i = 41222, box_w = 0, box_h = 0, anchor_w = 1139964740, anchor_h = 1138366106, iou = -11.000000 i = 95110, box_w = 0, box_h = 0, anchor_w = -829826680, anchor_h = 96, iou = -26.000000 i = 194408, box_w = 0, box_h = 0, anchor_w = -829826680, anchor_h = 88, iou = -2.000000
avg IoU = 57.68 %
Saving anchors to the file: anchors.txt anchors = 15.6696,27.8704, 69.0891,114.4718, 213.6109,267.2336, 365.7014,360.9078, 303.5794,528.7938, 484.9630,489.6827, 436.1766,743.6798, 605.8227,617.4738, 729.0613,799.2632
Can you please explain more about these values and anchors generated for the dataset ?
Thanks.
i = 41222, box_w = 0, box_h = 0, anchor_w = 1139964740, anchor_h = 1138366106, iou = -11.000000 i = 95110, box_w = 0, box_h = 0, anchor_w = -829826680, anchor_h = 96, iou = -26.000000 i = 194408, box_w = 0, box_h = 0, anchor_w = -829826680, anchor_h = 88, iou = -2.000000
You have at least 3 labels with width=0 height=-829826680 in your dataset: https://github.com/AlexeyAB/darknet/blob/6390a5a2ab61a0bdf6f1a9a6b4a739c16b36e0d7/src/detector.c#L999-L1002
Can you please explain more about these values and anchors generated for the dataset ?
Anchors are initial sizes of object. So Yolo v3 select one of anchor that most corresponds to the current object, and then just adjaust this anchors to the size of object, instead of predict absolute size of object.
Since Yolov3 has been trained MSCOCO and most of the people marked in the dataset are either close or visible to naked eye. If the dataset includes people marked quite far away in equal quality to the people marked in close will the object detection work ?
If this is not a dataset issue then why such scaling algorithms are producing better results ? https://github.com/peiyunh/tiny
Are we missing few image scaling functions in the training ?