Detected object coordinate (x, y) and custom training

MyVanitar commented 7 years ago

Hello,

How can I get coordinate information (x, y) of detected object(s)?

How can I train the Yolo2 for my own desired objects?

AlexeyAB commented 7 years ago

@VanitarNordic Hi,

You can add printf("%d, %d, %d, %d \n", left, right, top, bot); here: https://github.com/AlexeyAB/darknet/blob/master/src/image.c#L219

or also add:

int x_center = b.x*im.w;
int y_center = b.y*im.h
int width = b.w*im.w;
int height= b.h*im.h;

Training guide is in progress, yet: https://groups.google.com/d/msg/darknet/0ksFU91emmc/QMEO0HnHAgAJ

MyVanitar commented 7 years ago

Thank you very much.

Do you know ho we can add a live video camera support instead of image as input? You mentioned about a camera which is installed on a network (accessible by IP), but I mean host connected cameras such as internal webcam, USB3 , .... similar.

AlexeyAB commented 7 years ago

@VanitarNordic

Yes, for WebCamera number 0 you can use : darknet.exe detector demo data/voc.data yolo-voc.cfg yolo-voc.weights -c 0

AlexeyAB commented 7 years ago

@VanitarNordic

How can I train the Yolo2 for my own desired objects?

Now you can train Yolo v2 by using following instructions: https://github.com/AlexeyAB/darknet#how-to-train-pascal-voc-data

Original for Linux: http://pjreddie.com/darknet/yolo/#train-voc

MyVanitar commented 7 years ago

Thank you gentleman,

I read that briefly, but as I realized it is about re-generating the training data file based on VOC. what about if we have our selected discrete 1000 image files (which contain variation of a desired object within other objects) and decided to train the Yolo-2 with these?

I mean training with our own image files from scratch.

AlexeyAB commented 7 years ago

@VanitarNordic

To training for your 2 objects:

Copy yolo-voc.cfg to yolo-obj.cfg and change line classes=20 to classes=2
Create file obj.names with 2 objects names each in new line
Create file train.txt with filenames of your images each in new line
Create file obj.data containing:

classes= 2
train  = train.txt
valid  = test.txt
names = obj.names
backup = backup/

Create .txt-file for each image-file - with the same name, but .txt-extension, and put to it for each object on this image in new line: <object-class> <x> <y> <width> <height> - float values relative to width and height of image.

For example (atention: x, y - centers of rectangle) for img1.jpg you create img1.txt containing:

1 0.716797 0.395833 0.216406 0.147222 0 0.687109 0.379167 0.255469 0.158333 1 0.420312 0.395833 0.140625 0.166667

Download pre-trained weights for the convolutional layers (76 MB): http://pjreddie.com/media/files/darknet19_448.conv.23 and put to the directory build\darknet\x64
Run training: darknet.exe detector train obj.data yolo-obj.cfg darknet19_448.conv.23

MyVanitar commented 7 years ago

Thank you again Alexey.

I have some more questions:

1) in step-1 you mentioned: "Copy yolo-voc.cfg to yolo-obj.cfg and ..." . Do you man replacing the "yolo-voc.cfg" file with "yolo-obj.cfg"?

2) in step-4, Do you mean just about creating a file, which contains those information?

3) in step-5, do you know any toll which generates such annotation file? OpenCV has such a tool, but it produces annotation files differently (x, y are top left coordinate and they are integer values)

AlexeyAB commented 7 years ago

@VanitarNordic

I mean you should create new file "yolo-obj.cfg" with the same content as "yolo-voc.cfg", but with only one change classes=2
Yes.
No, I don't know such soft. About what tool in OpenCV do you talk, can you give link?

Also you can ask about it here: https://groups.google.com/forum/#!forum/darknet

AlexeyAB commented 7 years ago

@VanitarNordic

Also you should change filters=(classes + 5)*5 in your yolo-voc.cfg

I added How to train (to detect your custom objects): https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

MyVanitar commented 7 years ago

Thank you Alexey

Very good explanation.

I have a few question either.

1) if I wanted to detect one object type, such as just cars and nothing else, then the number of classes would be equal to 1?

2) The first name in the first line of the "obj.names" file, relates to class 1? and similarly line 2 correspond the class 2?

Finally still I don't understand why <x> <y> <width> <height> values for each image are float numbers. if I understand why is like that, I could maybe be able to create a software to make these files and values, if we couldn't find what tool the authors have used to make these.

AlexeyAB commented 7 years ago

@VanitarNordic

Yes, for 1 object, classes=1 in obj.data and in yolo-obj.cfg

(and filters=30 in yolo-obj.cfg)

Numbering starts at zero. If you have only one class of object, then <object-class> always be 0

There are used float values for <x> <y> <width> <height> because are relative to the absolute Width x Height of image, and can be equal from 0.0 to 1.0. The advantage of the relative values that are valid for any resizing images.

Input images can be any size (any width and height) both for training and prediction, and here any image resized to the neural-network size (416x416 or 448x448), but relative values <x> <y> <width> <height> still valid without changes: https://github.com/AlexeyAB/darknet/blob/master/src/demo.c#L49

MyVanitar commented 7 years ago

Thanks,

Please correct me if the below calculation is not correct:


(x, y: center of the rectangle)

relative x = absolute x / width 
relative y = absolute y / height

relative height = absolute height / height
relative width = absolute width / width

AlexeyAB commented 7 years ago

@VanitarNordic

Yes.

I created a new repository with GUI-software for generating annotation file for Yolo v2, which I wrote myself before: https://github.com/AlexeyAB/Yolo_mark

MyVanitar commented 7 years ago

Thank you,

may I ask you what speed (fps) have you achieved in testing the Yolo-2 on CPU? mine is very slow (few seconds for an image), other DNN based algorithms are slow in training but okay in test and run-time. am I doing something wrong?

MyVanitar commented 7 years ago

no idea?

AlexeyAB commented 7 years ago

@VanitarNordic

CPU Intel Core i7-6700K - 4 GHz 4(8) Cores: 0.3 FPS
GPU GeForce GTX 970 - 1 GHz 1664 Cores: 32 FPS

Darknet Yolo v2 is not optimized for CPU and use only 1 - 2 Cores.

MyVanitar commented 7 years ago

You have a sophisticated graphic card but 32FPS. it should be at last 60FPS for not blinking and real-time. Why the YOLO1 and 2 authors always claim it is fast algorithm?

AlexeyAB commented 7 years ago

I got 32 FPS for full Yolo v2 480x480 on GTX 970 without cuDNN. It is not fast GPU, top GPU Nvidia Titan X GP102 is 3 x faster.

GeForce GTX 970 - 3.5 TFlops-SP (without cuDNN)
GeForce GTX Titan X GM200 - 6.1 TFlops-SP x 1.74

Resluts:

YOLOv2 480 × 480 VOC - 32 FPS on GTX 970 (without cuDNN)
YOLOv2 480 × 480 VOC - 59 FPS on Titan X GM200 x 1.84

Did you try any else object detectors: Faster-RCNN ResNet-152, SSD 300/500 old & new*? map_fps

MyVanitar commented 7 years ago

480*480 is the input resolution (image or video)? from the curve I can assume that Yolo-2 is somewhere between speed and accuracy, isn't t?

I have tried the Dlib and it seems it is faster and more accurate

AlexeyAB commented 7 years ago

480x480 is input resolution of neural network. All YoloV2 points lies on optimal Pareto frontier, i.e. it is state-of-art. If you want more than 30 FPS on TitanX, those there is nothing better at the moment for accuracy/speed.

All objects-detectors of dlib are much less accurate. Which one object-detector do you use from dlib?

MyVanitar commented 7 years ago

Actually you have got 59FPS on Titan X as I see, which is good.

I am not deeply familiar with the algorithm itself, so if the input to the neural network is different with the main input, then what is the resolution of the main input images (or video from the camera) and what about if we decided to use HD resolution as camera or input? (Such as HDMI camera)

I used face pose detection on CPU and it was good. but because I do not have a professional GPU, I have not tested his last post here: http://blog.dlib.net/ What he claims about speed and accuracy is very good if he is right. it seems the accuracy is better than RCNN.

AlexeyAB commented 7 years ago

If you use 480x480 Yolo v2 and capture FullHD video 1920x1080, then each frame will be resized to 480x480, then will be processed by the neural network, with the best accuracy/speed among all realtime (>30 FPS) object-detectors.

If you want to detect very small objects (15x15 pixels) then you can divide the input image (1920x1080) into overlapping (10%) small images (480x480) and process each of them. You have to write this code yourself.

MyVanitar commented 7 years ago

What about Dlib's last blog post?

Also I have heard about Caffe. What is your opinion about them?

AlexeyAB commented 7 years ago

@VanitarNordic

It is necessary to distinguish: frameworks, apporoaches of region proposals, neural nets.

Frameworks:

Caffe: https://github.com/BVLC/caffe
Darknet: https://github.com/pjreddie/darknet
Tensorflow: https://github.com/tensorflow/tensorflow
Theano: https://github.com/Theano/Theano
Torch: https://github.com/torch/torch7

Approaches of region proposals - using Caffe:

RCNN: https://github.com/rbgirshick/rcnn
Fast-RCNN
- caffe-fork: https://github.com/rbgirshick/caffe-fast-rcnn
- approach: https://github.com/rbgirshick/fast-rcnn
Faster-RCNN
- caffe-fork: https://github.com/rbgirshick/caffe-fast-rcnn/tree/0dcd397b29507b8314e252e850518c5695efbb83
- approach: https://github.com/rbgirshick/py-faster-rcnn
R-FCN:
- caffe-fork: https://github.com/daijifeng001/caffe-rfcn
- approach: https://github.com/daijifeng001/R-FCN
SSD:
- caffe-fork & approach: https://github.com/weiliu89/caffe/tree/ssd
DetectNet nVidia (Yolo-based approach)
- caffe-fork: https://github.com/NVIDIA/caffe
- approach: https://github.com/NVIDIA/DIGITS/blob/master/examples/object-detection/README.md#detectnet

Neural Networks:

VGG16: https://gist.github.com/jimmie33/27c1c0a7736ba66c2395
GoogleNet: https://gist.github.com/jimmie33/7ea9f8ac0da259866b854460f4526034
AlexNet: https://gist.github.com/jimmie33/0585ed9428dc5222981f
MSRA ResNet 50, 101, 152: https://github.com/KaimingHe/deep-residual-networks
Yolo v1, v2
- Yolo v1: https://github.com/pjreddie/darknet/blob/8f1b4e0962857d402f9d017fcbf387ef0eceb7c4/cfg/yolo.cfg
- Yolo v2: https://github.com/pjreddie/darknet/blob/c6afc7ff1499fbbe64069e1843d7929bd7ae2eaa/cfg/yolo.cfg
DetectNet nVidia: https://raw.githubusercontent.com/NVIDIA/caffe/caffe-0.15/examples/kitti/detectnet_network.prototxt

For example, commonly used together:

framework(Caffe) + approach(SSD) + network(VGG16)
framework(Caffe) + approach(Faster-RCNN) + network(VGG16)
framework(Caffe) + approach(RFCN) + network(ResNet-101)
framework(Darknet) + approach(Yolo) + network(Yolo v2)
framework(Caffe) + approach(DetectNet based on Yolo v1) + network(DetectNet based on GoogLeNet)

MyVanitar commented 7 years ago

Thanks,

I mean DetectNet (object detection) which is trained based on NVCaffe. GoogleNet does the classification.

AlexeyAB commented 7 years ago

@VanitarNordic DetectNet worse than Yolo v2.

Results of DetectNet is absent in any tests for Detection:

PascalVOC: http://host.robots.ox.ac.uk:8080/leaderboard/main_bootstrap.php
ImageNet: http://image-net.org/challenges/LSVRC/2016/results#det
MS COCO: http://mscoco.org/dataset/#detections-leaderboard

DetectNet uses: framework(Caffe) + approach(DetectNet based on old Yolo v1) + network(DetectNet based on GoogLeNet)

MyVanitar commented 7 years ago

1) What about Dlib 19.2?

2) I am so curious if I could be able to train the Yolo-2 with DIGITS. probably it must have a caffemodel and a prototxt file.

3) What is your opinion about GTX 1080 GPU, can you predict how fast Yolo-2 would be (FPS) on this graphic card (for detection)?

AlexeyAB commented 7 years ago

@VanitarNordic

As said here, dlib-cnn + MMOD compared only with Caffe-FasterRCNN-VGG16, and only for faces. And in this case, perhaps it gives good results, better than FasterRCNN-VGG16.

But for other objects than faces it may have a bad result, dlib is absent in any public tests for Detection:

PascalVOC: http://host.robots.ox.ac.uk:8080/leaderboard/main_bootstrap.php
ImageNet: http://image-net.org/challenges/LSVRC/2016/results#det
MS COCO: http://mscoco.org/dataset/#detections-leaderboard

Also, current the best approach Caffe + RFCN + ResNet-101 (https://github.com/daijifeng001/r-fcn) has much better result, with x2 less errors, than FasterRCNN-VGG16.

I.e. dlib is not the best, but good.

No, you can't train Yolo-model in Caffe or Caffe-DIGITS. There is soft to convert Yolo v1 cfg-file and weights-file to prototxt and caffemodel, but it works only for old Yolo v1: https://github.com/xingwangsfu/caffe-yolo
You can simply compare this results from the picture for nVidia Titan X GM200 with 6144 GFlops With any nVidia GPU from this list: https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_10_series

88f81b8a-ce09-11e6-9516-8c3dd35dfaa7

nVidia Titan X GM200 with 6144 GFlops
nVidia GeForce GTX 1080 with 8228 GFlops - i.e. x1.34 faster than shown in picture

MyVanitar commented 7 years ago

Thank you. again very professional and comprehensive explanation. Really I have nothing to tell anymore. fantastic :-) Also you gave me a parameter to compare GPUs for DNN if I decided to purchase one wisely. Gflop

So by the way Yolo-2 should be the best both in terms of precision and speed, yes?

AlexeyAB commented 7 years ago

@VanitarNordic In different tests may be different winners. But there are three of the best methods for real-time:

Yolo v2: https://github.com/pjreddie/darknet
Caffe-PVANet+: https://github.com/sanghoon/pva-faster-rcnn
Caffe-SSD: https://github.com/weiliu89/caffe/tree/ssd

For not real-time the best Caffe-RFCN+ResNet101: https://github.com/daijifeng001/r-fcn

MyVanitar commented 7 years ago

The Caffe-PVANet refers to which model in the picture (Voc 2007 test I mean)?

SSD512 is accurate but is slow even on Titan X.

AlexeyAB commented 7 years ago

It is not on VOC2007, but is on VOC2012 (comparison for DNNs trained on very large data-set): http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?cls=mean&challengeid=11&compid=4&submid=9804

MyVanitar commented 7 years ago

well, according to the github description, it has achieved mAP=84.9 on VOC2007, but it has not mentioned the speed (FPS)

AlexeyAB commented 7 years ago

all on Titan X (GM200) PVANet+ mAP=84,2 FPS=22 PVANet+ (compressed) mAP=83,7 FPS=31 https://arxiv.org/pdf/1611.08588v2.pdf

MyVanitar commented 7 years ago

1) When the FPS is low and the model is accurate, is there anyway to achieve a higher speed? is there any other hardware to perform faster than GPU?

2) Where you got the Pascal Voc 2012 result?

3) Does the memory of the GPU influence the model accuracy on training (typically we have to adjust the batch sizes to applicable with GPUs with lower memory sizes)

MyVanitar commented 7 years ago

Also, have you heard about YOLO9000?

MyVanitar commented 7 years ago

There was a chart on your previous posts about the competition results but I can not see that Image now. can you upload it again or mention the source?

AlexeyAB commented 7 years ago

@VanitarNordic All on nVidia Titan X (GM200)

Figure 4: https://arxiv.org/pdf/1612.08242v1.pdf
Got from many articles: https://drive.google.com/file/d/0BwRgzHpNbsWBTk13bHRnMWFEdVU/view

AlexeyAB / darknet

Detected object coordinate (x, y) and custom training #2