Negative and Positive Samples in Same Sample

silvernine209 commented 6 years ago

If I have "pictureA.jpg" with dog(class 0) and person(class 1), I will have "pictureA.txt" with something like :

0 0.716797 0.395833 0.216406 0.147222 1 0.687109 0.379167 0.255469 0.158333

Now, can I include cat (class 2) without any info to let the training know that it is a negative sample(class)? Something like this :

0 0.716797 0.395833 0.216406 0.147222 1 0.687109 0.379167 0.255469 0.158333 2

I'm trying to improve training by following recommendation below, but I'm not sure how I can best execute it.

desirable that your training dataset include images with non-labeled objects that you do not want to detect - negative samples without bounded box (empty .txt files) - use as many images of negative samples as there are images with objects

AlexeyAB commented 6 years ago

Now, can I include cat (class 2) without any info to let the training know that it is a negative sample(class)? Something like this :

0 0.716797 0.395833 0.216406 0.147222 1 0.687109 0.379167 0.255469 0.158333 2

You shouldn't do it.

Negative samples - are images without any objects with empty label txt-file.

silvernine209 commented 6 years ago

@AlexeyAB Thank you for prompt clarification!

I was looking into ways to improve and speed training up since I have 1.6 million images and 500 classes, which would take about 500*2,000 = 1,000,000 iterations for decent performance.

I was initially doing transfer learning, but estimated about a month of training on my GTX1070, so I switched to fine tuning. I'm currently at 30,000 iterations and below is general progress:

1 ->10,000 iteration : avg loss decreased from ~1,300 to ~4 fairly quickly
10,000 -> ongoing : avg loss decreases VERY slowly from ~4 and onward. Probably at -0.1 avg loss per 10,000 iterations.
- From 20,000 to 30,000 iterations, both IoU is ~20% and mAP increased from ~2% to ~3%.

I obtained new anchors as you recommended, and annotation boxes, which were drawn by professional annotators, are correct since they were given by Google for a Kaggle competition.

Given the above, is the progress looking pretty normal to you given the size of dataset and classes? All parameters in .cfg is default for fine tuning process. Do you recommend any tweaks (maybe to learning rate) to speed up the process? I will be satisfied even with mAP around 30%.

Thank you for your time.

AlexeyAB commented 6 years ago

I was looking into ways to improve and speed training ...

I switched to fine tuning.

Fine-tuning speedup training only slightly.
Did you add stopbackward=1 param in cfg-file, and where?
What cfg-file do you use, is it yolov3.cfg?

From 20,000 to 30,000 iterations, both IoU is ~20% and mAP increased from ~2% to ~3%.

Are you annotations converted to the Yolo format correctly? Check it by using yolo_mark: https://github.com/AlexeyAB/Yolo_mark
You should achive at least mAP=15% (50% of final mAP=30%) for the 1st epoch = 25000 iterations = 1 600 000(images) / 64(batch), so may be something wrong.
Can you show your new anchors?
Did you do anything from this list? https://github.com/AlexeyAB/darknet#how-to-improve-object-detection

silvernine209 commented 6 years ago

For my case, fine-tuning seems to be doing great so far

I added stopbackward=1 at L548 above ####### as recomended
I'm using yolov3.cfg
I will double check using yolo_mark, but I think all annotations were converted correctly. All of annotation bboxes were given x_min, y_min, x_max, and y_max with zero coordinate in left bottom of the image. I obtained x_center, y_center, width, and height. x_center = x_min + width/2,y_center = y_min+height/2,width = abs(x_max - x_min), height = abs(y_max - y_min).
I'm currently at 23,100 iteration of fine-tuning and at 23.22 IoU and 5.34% mAP
New anchors = 19,26, 46,82, 66,173, 137,94, 94,301, 180,210, 353,184, 212,355, 381,374
random=1 was used. Resolution was not changed yet for detection, and will do so at the end of training. Anchors were recalculated. Wasn't able to verify if all objects are labeled since dataset is 1.6 million, but Google did a very good job annotating as much as possible. And nothing else has been done.

Here is .cfg file if you would like to take a look

[net]
# Testing
#batch=1
#subdivisions=1
# Training
batch=64
subdivisions=32
width=416
height=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
burn_in=1000
max_batches = 100000
policy=steps
steps=400000,450000
scales=.1,.1

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky

# Downsample

[convolutional]
batch_normalize=1
filters=64
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=32
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=128
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=64
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=64
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=256
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=512
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear
stopbackward=1
######################

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=1515
activation=linear

[yolo]
mask = 6,7,8
anchors = 19,26,  46,82,  66,173,  137,94,  94,301,  180,210,  353,184,  212,355,  381,374
classes=500
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -4

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 61

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=512
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=512
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=512
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=1515
activation=linear

[yolo]
mask = 3,4,5
anchors = 19,26,  46,82,  66,173,  137,94,  94,301,  180,210,  353,184,  212,355,  381,374
classes=500
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -4

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 36

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=256
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=256
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=256
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=1515
activation=linear

[yolo]
mask = 0,1,2
anchors = 19,26,  46,82,  66,173,  137,94,  94,301,  180,210,  353,184,  212,355,  381,374
classes=500
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

silvernine209 commented 6 years ago

@AlexeyAB I think I found what I was doing wrong. What a dummy..............!! so for y_center, I did y_min + height/2 on a left&bottom zero coordinate system. To correct all wrong doings, I just need to do 1 - y_min + height/2. I wasted daysssssss! Thank you so so much Alexey for helping me to realize this. I'm sure training will go alot better after I fix y_center of all files.

AlexeyAB commented 6 years ago

I will double check using yolo_mark, but I think all annotations were converted correctly. All of annotation bboxes were given x_min, y_min, x_max, and y_max with zero coordinate in left bottom of the image. I obtained x_center, y_center, width, and height. x_center = x_min + width/2,y_center = y_min+height/2,width = abs(x_max - x_min), height = abs(y_max - y_min).

@AlexeyAB I think I found what I was doing wrong. What a dummy..............!! so for y_center, I did y_min + height/2 on a left&bottom zero coordinate system. To correct all wrong doings, I just need to do 1 - y_min + height/2. I wasted daysssssss! Thank you so so much Alexey for helping me to realize this. I'm sure training will go alot better after I fix y_center of all files.

Do you mean that in this Google dataset the coords [x_min,y_min] is the bottom-left corner instead of top-left corner?

Did you check y_center = y_min+height/2 or y_center = 1 - y_min+height/2 by using Yolo_mark?

silvernine209 commented 6 years ago

Yes, in Google dataset, coords [x_min,y_min] is the bottom-left corner instead of top-left corner.

I compiled Yolo_mark but I couldn't get it to work. However, using airplane and bird examples in Yolo_mark repo, I verified that all of my y_center was wrong since it used y_center = y_min+height/2, and indeed y_center = 1 - y_min+height/2 is correct.

I'm updating all text files of train images and validation images, and will try to get some results tonight. I will report on the progress, but I expect day and night difference compared to:

lp-094 commented 5 years ago

whether the txt include nothing? thank you

AlexeyAB / darknet

Negative and Positive Samples in Same Sample #1400