Raspberry Pi YOLO Training

WTeichert commented 6 years ago

Greatings everyone,

I am in the middel of my student research project. Therefor I am creating an object detection and classification which fits for a pi. I am using YAD2k running on PI, because it has less computational demands. I plan to train my network by VOC with different training cfg's.

I am asking you, for some advises, tipps or tricks I can use.

I will change so far:

number of convoltional and pooling layers + filters per layer (like yolo-voc -> yolo-voc-tiny)
learning rate, steps, scale
max batches
height and weight (to a minimum of 224, possible? what should I do with anchors, devide by 2? Do I have to add "resize_network(nets + i, nets[i].w, nets[i].h);" in detector.c line 40-41?
greyscaled pictures with channels=1
random on or off
max pooling size
stride to 2 <- much more speed, less accuracy

I have also few questions: What does activation: leaky or linear do? saturation/exposure are always the same, what do they do?

Thank you for all inspiration! :)

AlexeyAB commented 6 years ago

Hi,

height and weight (to a minimum of 224, possible? what should I do with anchors, devide by 2? Do I have to add "resize_network(nets + i, nets[i].w, nets[i].h);" in detector.c line 40-41?

You shouldn't add resize_network(). Just set width=224 height=224 in the your cfg-file
Yes, just divide anchors by 2
if you use random=1 then you should change these two line: https://github.com/AlexeyAB/darknet/blob/75c39f57507ed81f33571271b939de400a69adf0/src/detector.c#L98-L99

to these, for resolution ~224x224

            int dim = (rand() % 5 + 5) * 32;
            if (get_current_batch(net)+100 > net.max_batches) dim = 224;

random=1 gives you about +1% mAP

What does activation: leaky or linear do?

linear: y = x
leaky (ReLU): if(x>0) { y = x; } else { y = x/10; }

saturation/exposure are always the same, what do they do?

saturation, exposure and hue values - ranges for random changes of colours of images during training (params for data augumentation), in terms of HSV: https://en.wikipedia.org/wiki/HSL_and_HSV The larger the value, the more invariance would neural network to change of lighting and color of the objects. More: https://github.com/AlexeyAB/darknet/issues/279#issuecomment-347002399

WTeichert commented 6 years ago

Thank you alot!

Of course I first watch the training. There I came to the point, that Darknet19 448x448 should be used. You've written that "This model performs significantly better but is slower since the whole image is larger.". Since I need to speed and tight up the whole algorithm, I want to use darknet19 on it s basic configuration.

Now my question, from where can I get these darknet19.conv.xx for training? May I can use yolo-voc-tiny.weights as my basic, like this backup training? (but my cfg changed in a few lines)

And one more question: I got access to a computation centre where I ve 4 CPUs and 2 GPUs I can use. As I've read on your page, YOLOv2 is not made for multi cpu. Are there have been some changes? Is it helpful, that tensorflow is configured for multi CPU?

AlexeyAB commented 6 years ago

There is path to darknet19_448.conv.23: http://pjreddie.com/media/files/darknet19_448.conv.23 You can find this path here: https://github.com/AlexeyAB/darknet#how-to-train-pascal-voc-data
You can do darknet.exe partial tiny-yolo-voc.cfg tiny-yolo-voc.weights tiny-yolo-vo.conv.13 13, so you will get pre-trained file tiny-yolo-vo.conv.13, then you can use it for training
You can use multi-GPU for training: https://github.com/AlexeyAB/darknet#how-to-train-with-multi-gpu But for detection you can use only one GPU.
If you don't want to use GPU, then you can use multi-CPU (many Cores in one CPU and many CPUs on the one motherboard - ccNUMA) - but it's slow enough
- on Windows compile build\darknet\darknet_no_gpu.sln https://github.com/AlexeyAB/darknet#how-to-compile-on-windows
- on Linux change OPENMP=1 in the Makefile and run make https://github.com/AlexeyAB/darknet#how-to-compile-on-linux

WTeichert commented 6 years ago

Thank you again ^^ This partial training sounds interessting! I take a trained weight and make it as my pre-trained base? Where does the 13 come from (=number of layers - last "class" layer)? Can I use my own cfg, or do I need to use tiny-yolo-voc.cfg? Wouldn t I ve problems when they are different?

Little missunderstanding,

I look for darknet19, not trained on 448x448, so the previous version of it! I want to use it all, so 4 cpus + 2 gpus for training.

For detection I ve to look forward, to get the best out of a Raspberry Pi.

AlexeyAB commented 6 years ago

Tiny-yolo has 16 layers, where last 2 layers: are detection-layer and conv-layer that depends on number of classes. So you can use any number of the first layers from 1 to 14 for partial.
To do partial you should use cfg that corresponds to the weights file - i.e. tiny-yolo-voc.cfg for tiny-yolo-voc.weights.
Just use multi-GPU, each GPU is 100x times faster than each CPU, so there isn't any reason to use CPU: https://github.com/AlexeyAB/darknet#how-to-train-with-multi-gpu

WTeichert commented 6 years ago

Hey, first of all, thank you for your time. I am now done with the trainings (had some different stuff to do), but it doesn t work out like I thought.

I tried to train on Pascal Voc, followed your instructions, went all fine. Not sure if that matters, I chose the pre-trained model Darknet19_448.conv.23 instead of darknet53.conv.74 (I think this was changed by you?) My cfg1 you can see below. With 45 000 iterations it mostly detects chairs, doesn t matter if it is a person or a dog or whatever for cfg4 i just changed width+height to 608 and multiplied the anchors by 4 -> their is no detection at all, also IOU and Recall are 0 when i try to valid the weights

Did I missed something or is it just a network conflict, that the parameters doesn t fit to the dataset? overview to all cfg's I trained

cfg_overview.pdf

`[net] batch=64 subdivisions=64 width=224 height=224 channels=3 momentum=0.9 decay=0.0005 angle=0 saturation = 1.5 exposure = 1.5 hue=.1

learning_rate=0.001 max_batches = 45000 policy=steps steps=100,25000,35000 scales=.1,.1,.1

[convolutional] batch_normalize=1 filters=16 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=32 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=64 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=1

[convolutional] batch_normalize=1 filters=1024 size=3 stride=1 pad=1 activation=leaky

###########

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky

[convolutional] size=1 stride=1 pad=1 filters=125 activation=linear

[region] anchors = 0.54,0.60, 1.71,2.2, 3.32,5.69, 4.71,2.55, 8.31,5.26 bias_match=1 classes=20 coords=4 num=5 softmax=1 jitter=.2 rescore=1

object_scale=5 noobject_scale=1 class_scale=1 coord_scale=1

absolute=1 thresh = .6 random=0 `

AlexeyAB commented 6 years ago

@WTeichert

darknet53.conv.74 should be used only if your cfg-file based on yolov3.cfg (only for Yolo v3). But if your cfg-file based of tiny-yolo-voc.cfg, or yolov2-tiny-voc.cfg or yolo-voc.2.0.cfg or yolov2-voc.cfg (Yolo v2) then you should use darknet19_448.conv.23
On what cfg-file did you base your cfg-file?
Do you try to train yolo on CPU?
Can you get any good results or the results of all the trainings are bad?
How many iterations did you train?

WTeichert commented 6 years ago

Than I was right
based on tiny-yolo-voc.cfg
trained on gpu, cpu was way to slow (50min for 1k iterations)
no, all are trash
45k-60k, see max batches. +I trained from windows

AlexeyAB commented 6 years ago

What was average loss?
And what mAP can you get for one of your weights file? darknet.exe detector map voc.data your.cfg your_40000.weights
Can you compress your files: .data, .cfg, .names, train.txt, cmd-file for training - and pin this compressed archive here to the message?

WTeichert commented 6 years ago

I don t know
mAP doesn t work, get this error... File "...\YOLOv2\darknet-master\build\darknet\x64\voc_eval_py3.py", line 157, in voc_eval R = class_recs[image_ids[d]] KeyError: '003028'
training_data.zip

For next day I ve nomore access to this data, so further data can be sended on monday

AlexeyAB commented 6 years ago

You should change this line: https://github.com/AlexeyAB/darknet/blob/d0039f6dfd749feb8ef0be9137d6d19058093933/build/darknet/x64/calc_mAP.cmd#L7

To this line (with your files: data, cfg, weights): darknet.exe detector map data/obj.data cfg/yolo_obj.cfg yolo-obj.weights

And run it.

Also what cammnd do you use for training?

WTeichert commented 6 years ago

I ve tried both ways. First with my data and cfg, and 2nd with your tiny-voc and voc. Doesn t worked.

The command for training is written in train.cmd darknet.exe detector train data/cfg1.data cfg/cfg1.cfg darknet19_448.conv.23

AlexeyAB commented 6 years ago

@WTeichert This command: darknet.exe detector map data/cfg1.data cfg/cfg1.cfg cfg1_40000.weights can't give this error, because there isn't any from Python.

What is the error gives this command?

WTeichert commented 6 years ago

Ahh little missunderstanding. I tried using map, but nothing happend (so far as i can see) So I chose calc_mAP_voc_py.cmd and changed line 8 to my files and line 9 to my voc dir

voc_eval_py3.py", line 157 was the error

Could it be, that I chose the learning rate too little, so the network doesn t learn something new out of new input size? Or could it be, that the change in anchors lead to this mistake?

AlexeyAB commented 6 years ago

Attach screenshot of "nothing happend" that happen after this command - you should wait sometimes 10 minutes while mAP will be calculated) darknet.exe detector map data/cfg1.data cfg/cfg1.cfg cfg1_40000.weights

I don't know is there any mistake. I can't say anything without mAP. What repo did you use for training?

WTeichert commented 6 years ago

nothing happens, means nothing i can see directly. I am not sure which repo I am using (repo?) but since in the introduction is said "If you use another GitHub repository, then use darknet.exe detector recall... instead of darknet.exe detector map". I tried both and map has not give me any visual output. cmd just ended and console is waiting for next cmd, nothing happend (<- that s ment by it, their is no screenshot for it) so I chose recall to check IOU and recall. IOU I got maximum of 28% I copyed this github we re talking on and followed instructions for windwos, so I thought should be this repo... When I use darknet.exe detector valid, it created lot of blank class files in results. Instead with yolo-voc they were full with notes and detections, that s all I can say to mAP.

Problem is I am not into c programming, I just in a little python, so all the dector.c - compile and functions in c I understand on the very top.

I am not at office these days, so I can try once more on Tuesday, but I don t think it will change the results.

WTeichert commented 6 years ago

first commend was with recall instead of map, second doesn t give any result as far as I can see

first commend was with valid instead of recall and created the files in results

AlexeyAB commented 6 years ago

@WTeichert Try to update your code from this repo.

WTeichert commented 6 years ago

Done, but same error. Was someting changed to detector? Because I cannot update darknet version, since my MSVS liscenes expired.

AlexeyAB commented 6 years ago

You should recompile code in MSVS after that your repo is updated. Yes, there was added Yolo v3, fused batch_norm (+7% speedup), calc anchors and mAP, AVX on CPU (+20% speedup) and many other things... You can install free MSVS2015 community that I use: https://go.microsoft.com/fwlink/?LinkId=532606&clcid=0x409

WTeichert commented 6 years ago

Finally map works ! needed a few trys with cuda 9.1, 9.0, 8.0 and their cuDnn libarys cause this error accured: errorcuda81 I solved it with creating new repo instead of updating.

grafik It was the cfg which only gave me chairs as output

grafik It was the cfg with 0 IOU and Recall and no detections.

The difference between them was: height weidth at first cfg=224 and at second 608

WTeichert commented 6 years ago

I just checked again difference between my cfg and tiny yolo voc. I changed anchors, weidth and height, deleted comments, and their are these lines: steps= -1,100,20000,30000 scales=.1,10,.1,.1 I chose instead: steps=100,25000,35000 scales=.1,.1,.1

Because I did not understood the -1

AlexeyAB commented 6 years ago

@WTeichert It's bad mAP result. Check your dataset using Yolo_mark. And use these lines:

learning_rate=0.0001
max_batches = 45000
policy=steps
steps=100,25000,35000
scales=10,.1,.1

WTeichert commented 6 years ago

Ah found it, but I used PascalVoc Dataset, do I need to mark bounding boxes?

@AlexeyAB I ve done check. The labels are not correctly signed, like persons are chairs, cats are boats. Should be connected to the voc.names list, am I right?

But the bounding boxes are all right!

And that doesn t explain why detection doesnt work. Should I train new network with these lines? learning_rate=0.0001 max_batches = 45000 policy=steps steps=100,25000,35000 scales=10,.1,.1

That would be sad... to not know, why it doesn t work and just try again...

WTeichert commented 6 years ago

These are the results of yolov2-tiny-voc weights ... there should be an error somewhere else. How does mAP depends on names list? Could the order of names cause this error?

If I compare the labels folder of voc labels with the voc.names file there are changes. 5 and 8 should be dog and person, while in voc.names that s are bus and chair

but when I validate tiny-voc with some example pictures, it is pretty good

AlexeyAB commented 6 years ago

Show your file obj.data
What command do you use for training?
What command do you use for calc mAP?

WTeichert commented 6 years ago

@AlexeyAB

darknet.exe detector train data/cfg1.data cfg/cfg1.cfg darknet19_448.conv.23
darknet.exe detector map data/cfg1.data cfg/cfg1.cfg cfg1_final.weights

Btw. why do i get complete different IOU and recall with commend: ... detector map ... or ... detector ... recall

WTeichert commented 6 years ago

@AlexeyAB I tryed to train with changed learning rate and anchors to standard. But result is again 0 detections, mAP is 0 too. Do you have used the windows method to train tiny-yolo-voc.cfg? Is your version different from the repo? How many iterations did you trained? Which command do you used for training?

AlexeyAB commented 6 years ago

@WTeichert I trained any models on both Windows and Linux using this repo. It works fine.

WTeichert commented 6 years ago

@AlexeyAB I found a mistake in my train.txt. Now I gte 56,21% mAP for yolov2-tiny-voc

I tryed to train with 11 classes of VOC, so shortend class-list in voc-label.py and voc-names. Also set number of class to 11 in voc.data, cfg and last filter to 80. Again I get 0 mAP.

What was your average loss in training? I am always around 0,5 which seems pretty high.

AlexeyAB commented 6 years ago

@WTeichert About ~0.5

WTeichert commented 6 years ago

Ok, I found the problems. Was some mess with the voc.data and label.txt files.

But I am still wondering why the cfg of tiny-yolo-voc starts steps with -1. If you could explain me that fact, I won t ask anything anymore :D

AlexeyAB commented 6 years ago

step with -1 means that 1st scale 0.1 will be applied immediately. It was left just for some experiments.

This is: https://github.com/AlexeyAB/darknet/blob/5e3dcb6f34868e341466b57b13ad63f86b337250/cfg/yolov2-tiny-voc.cfg#L18-L22 the same as reduced learning_rate and removed 1st steps/scales:

learning_rate=0.0001
max_batches = 40200
policy=steps
steps=100,20000,30000
scales=10,.1,.1

Because net.steps[i] > batch_num i.e. -1 > 0 then 1st scale is applied immediately: https://github.com/AlexeyAB/darknet/blob/5e3dcb6f34868e341466b57b13ad63f86b337250/src/network.c#L94-L101

WTeichert commented 6 years ago

Thank You so much for your help! My research is done and went all well.

The manipulation of the filter number per layer and the reduce of the resolution brings the best performances of a pi! Models, Classes and Greyscale are not that easy to manipulate, dependend of dataset. Random should be selected, Performance decrease is minimal.

AlexeyAB commented 6 years ago

@WTeichert Can you attach your result cfg-file?

WTeichert commented 6 years ago

Sorry, I lost the orginals at a system reset and have only the converted h5 files...

use VOC dataset
resoultion set to minium of 224
half the amount of filter per layer
reduce number of layer by one
set random =1

That were the results I found. Performance on Pi increased from 4s to 1s per picture with a pretty good mAP Mainfocus of my work were the decrease of fps. Hope these information can help

here the h5 file with changed layernumber based on COCO COCOh5.zip

here the h5 file with changed number of filter per layer based on COCO COCOh5_2.zip

AlexeyAB / darknet

Raspberry Pi YOLO Training #289