Multiple Detector instances using Yolo as DLL

AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )

http://pjreddie.com/darknet/

Other

21.75k stars 7.96k forks source link

Multiple Detector instances using Yolo as DLL #73

Closed fjoanr closed 7 years ago

fjoanr commented 7 years ago

Hello Alex,

first of all, thank you for this repository, it works wonders so far!

I would like to know how to correctly delete a Detector instance when using YOLO as a DLL in my own VS project... What I have at the moment is: Detector detector1("yolo-voc.cfg", "yolo-voc.weights"); Detector detector2("yolo-voc.cfg", "yolo-voc.weights");

... some implementation to train an external Net ...

detector1.~Detector(); detector2.~Detector();

The problem is when compiling I get an error saying that "yolo_console_dll.exe has triggered a breakpoint" captura

which is in yolo_v2_class.cpp line 78: free(detector_gpu.avg);

Do you have any idea how to destroy a Detector class variable or how could it be solved using the ~Detector() class?

thank you very much for your time,

best regards, Francesc.

AlexeyAB commented 7 years ago

@fjoanr Hi,

This is because the destructor is called twice.

If you create a Detector-object, then you should not call destructor:

Correct:

{
    Detector detector1("yolo-voc.cfg", "yolo-voc.weights");
    Detector detector2("yolo-voc.cfg", "yolo-voc.weights");
    Detector detector3("yolo-voc.cfg", "yolo-voc.weights");
} // destructor called automatically

Wrong:

{
    Detector detector1("yolo-voc.cfg", "yolo-voc.weights");
    Detector detector2("yolo-voc.cfg", "yolo-voc.weights");
    Detector detector3("yolo-voc.cfg", "yolo-voc.weights");
    detector1.~Detector();
    detector2.~Detector();
    detector3.~Detector();
} // destructor called automatically again

If you want to destroy Detector manually, then use std::shared_ptr<>

#include <memory>

{
    std::shared_ptr<Detector> detector1 = std::make_shared<Detector>("yolo-voc.cfg", "yolo-voc.weights");
    std::shared_ptr<Detector> detector2 = std::make_shared<Detector>("yolo-voc.cfg", "yolo-voc.weights");
    std::shared_ptr<Detector> detector3 = std::make_shared<Detector>("yolo-voc.cfg", "yolo-voc.weights");

    cv::Mat mat_img = cv::imread("x64/data/dog.jpg");
    std::vector<bbox_t> result_vec = detector1->detect(mat_img, 0.2);

    detector1.reset();  // destructor called manually
    detector2.reset();  // destructor called manually
    detector3.reset();  // destructor called manually
}   // If the destructor has already been called, it is no longer called, otherwise destructors called

But I also note that there is bug in Yolo - that when the detector is repeatedly created and deleted many times, a memory leak occurs. So far I have not done this fix.

fjoanr commented 7 years ago

Hello @AlexeyAB

Thank you for the fast response. Since my program was running out of memory I decided to compile on an individual project the 2 YOLOs plus the neural net and saved the neural network's weights and then uploaded them onto the main project. It solved the issue with multiple detectors not being able to be created and deleted.

I now have a new issue. For the correct implementation of the fusionNet, I need to remove the last layer of Yolo, so it would end with the Convolutional layer

[convolutional]
size=1
stride=1
pad=1
filters=125
activation=linear

If I run the detector(), as the functions implies, it returns a vector, however, the last convolutional layer is supposed to return a 13x13x125 feature map. Is there any way to save the feature map from the last layer? Or will I have to create an independent detector() function?

thank you again for your help, it is much appreciated.

Francesc.

AlexeyAB commented 7 years ago

@fjoanr Hi,

If you didn't remove last region-layer from the .cfg-file, then you should implement your own detect()-function.

You can get last layer layer l = net.layers[net.n - 1]; as in detect(): https://github.com/AlexeyAB/darknet/blob/master/src/yolo_v2_class.cpp#L166
And implement your own get_region_boxes(): https://github.com/AlexeyAB/darknet/blob/master/src/region_layer.c#L328


    int i,j,n;
                                     // 13*13*(20+5)*5 = 21125
    float *predictions = l.output;   // 21125 values for 13x13 WxH, 20 classes and 5 anchors

    for (i = 0; i < l.w*l.h; ++i){   // 13x13  (W x H)
        int row = i / l.w;
        int col = i % l.w;
        for(n = 0; n < l.n; ++n){    // 5  (anchors)
            int index = i*l.n + n;

            int pred_index = index * (l.classes + 5); // 13*13*5 * (20(classes) + 4(coords)+1(To))
            float val = predictions[pred_index];
            float val = predictions[pred_index + 1];
            ...
            float val = predictions[pred_index + 24];

       }
   }

AlexeyAB commented 7 years ago

@fjoanr Which of 3 fusionNet do you use? https://arxiv.org/find/all/1/all:+FusionNet/0/1/0/all/0/1

fjoanr commented 7 years ago

@AlexeyAB Thank you for the info, I will check it out today to see if I can manage it!

About the fusionNet, I am using https://arxiv.org/abs/1507.06821 but with the two streams being ConvNets on RGB images. I am also investigating on using an extra convolutional layer instead of the fully connected layers, but we will see about it :D

AlexeyAB commented 7 years ago

@fjoanr Yes, this is an interesting topic, how much can improve the accuracy on top CNNs by using RGB-D compared with RGB .

What do you use to get Depth on image?

active ToF (Time of Flight) cameras, such as on Google Project Tango or on Kinect?
lidar such as on Google Car?
stereo cameras and cv::gpu::StereoConstantSpaceBP from OpenCV?
or simply ready-made datasets: https://rgbd-dataset.cs.washington.edu/dataset/

Earlier I thought that it is possible to achieve good help from a depth-map only obtained from active cameras (ToF, Lidar ...). But now there is reason to believe that the depth-map obtained from passive stereo cameras also has enough accuracy to help in the recognition of objects. Autopilot on Tesla cars should uses 3D reconstruction and depth-map to help detect objects, and there is used passive optical cameras (there should be very good cameras, with excellent optics and high resolution): http://blog.ted.com/what-will-the-future-look-like-elon-musk-speaks-at-ted2017/

What’s happening at Tesla? Tesla Model 3 is coming in July, Musk says, and it’ll have a special feature: autopilot. Using only passive optical cameras and GPS, no LIDAR, the Model 3 will be capable of autonomous driving. “Once you solve cameras for vision, autonomy is solved; if you don’t solve vision, it’s not solved

Previously, the good quality of the depth-map could be obtained only by using active cameras, such on Kinect:

507cubfhca4

fjoanr commented 7 years ago

Hey @AlexeyAB

it is indeed a very interesting approach and probably with the correct testing it might be able to improve the quality of the currently state-of-the-art object detectors. In my case, I am not using a Depth-stream, but 2 RGB-streams. The idea of using footage from ToF or stereo cameras is interesting but don't you think the computational cost and the hardware costs for implementing a surveillance system would increase too much?

At one point this year I was using the Kinect to obtain depth images and worked on another human tracker, but the quality of the depth images from Kinect is (right now) far from ideal, i.e., it needs a lot of pre-processing to be useful in any kind of application.

I have been able to obtain and save the features from the last convolutional layer of YOLO. Now, when I try to train the new fusion-stream (with one convolutional layer and the region layer), I get an error in the console as in the image below:

captura

The fusionnet.cfg file is:

[net]
# Testing
# batch=1
# subdivisions=1
# Training
batch=64
subdivisions=8

height=13
width=13
channels=1
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
burn_in=1000
max_batches = 80200
policy=steps
steps=40000,60000
scales=.1,.1

[convolutional]
size=1
stride=1
pad=0
filters=125
activation=linear

[region]
anchors =  1.3221, 1.73145, 3.19275, 4.00944, 5.05587, 8.09892, 9.47112, 4.84053, 11.2364, 10.0071
bias_match=1
classes=20
coords=4
num=5
softmax=1
jitter=.3
rescore=1

object_scale=5
noobject_scale=1
class_scale=1
coord_scale=1

absolute=1
thresh = .6
random=1

And the images are stored in darknet-master/CONVdevkit/conv and conv_test, with images being stored as 00_0x (x from 0 to 62500). Do you have any idea or suggestion on where should I look in the main code to understand the issue?

Thank you!

AlexeyAB commented 7 years ago

@fjoanr Hi,

Please, show content of files data/obj.data and small part of content of file train.txt
Also what do you send to the input of network as 13x13x1? I.e. which part of the saved features 13x13x125 (from the last convolutional layer of YOLO) do you send to the input?
Did you save layer-30 (13x13x125) or layer-29 (13x13x1024)?
You should use for your own training only old yolo-voc.2.0.cfg. And only for already trained yolo-voc.weigts you can use new yolo-voc.cfg I would recommend using yolo-voc.2.0.cfg instead of yolo-voc.cfg as a template for your own .cfg-file (or yolo.2.0.cfg instead of yolo.cfg)

(because new yolo-voc.cfg used for already trained weights from https://pjreddie.com/darknet/yolo/ and requires additional code for training, that isn't well debuged yet, and that present now only on Linux, but isn't in this Windows fork)

There is some problem to train on Windows by using new yolo-voc.cfg or yolo.cfg: https://github.com/AlexeyAB/darknet/issues/71#issuecomment-298823064

Yes, on stereo-cameras computational cost of cv::gpu::StereoConstantSpaceBP from OpenCV is very big, and it gives very bad distance map, that also requires very coarse and costly filtering (BilateralFilter, MedianFilter ...). On stereo-cameras with 700x500 I got about ~1 FPS and 70x50 resolution of depth-map, by using mGPU (GeForce GTX 645 - 800 GFlops).
But as I understand computational cost of depth-map from ToF is small enough, because ToF gives a ready-made distance map (doesn't require StereoConstantSpaceBP), and because it remains the only cost of filtration (BilateralFilter or MedianFilter ...). Kinect Fusion looks good enough: https://www.youtube.com/watch?v=ra3xxLepRfA
Lidar can sees much further and more precisely, but its price is much more expensive 10 000$ instead of 100$ for ToF

fjoanr commented 7 years ago

Hello,

obj.data file (obj.names contains the same values as voc.names, as I'm using VOC database to train):

classes= 20
train  = data/conv/conv.txt
valid  = data/conv/conv_test.txt
names = data/obj.names
backup = backup/

train.txt is my conv.txt file, with paths to the images as: C:/VGIS8/darknet-master/CONVdevkit/ConvOutput/2017_00.jpg, C:/VGIS8/darknet-master/CONVdevkit/ConvOutput/2017_01.jpg, ...
The last conv layer outputs a 13x13x125 feature map. Since I want to store them as images, I divide each map into 125 outputs. Thus, the input to the network is 13x13x1 representing each depth slice of the feature map of 125 slices (using 500 train images I would input 62500 13x13 images to the fusion layer). The layer I am saving is indeed the 13x13x125, and to obtain the same output size after the convolutional layer of my fusionnet, I apply the convolution without stride, to maintain the size.
I will modify the parameters on fusionnet.cfg with the parameters of yolo-voc.2.0.cfg, and test again to check if the network trains without pre-trained weights better.

.

That Kinect app looks pretty good in terms of accuracy indeed. Do you know if it is the new Kinect (I think it's called Kinect2.0) or the old one? I was only able to work with the old one that I had available here and the maximum fps it could work on was around 30 at 640x480 resolution, before pre-processing...
That is the issue, training a model with both depth and RGB for, i.e., the ILSVRC might be good enough because the pictures are not in open spaces nor big areas. However, if you want to design a model to work in real-life cameras for surveillance applications, the hardware costs would increase too much (at least for my project :D ) and dropping the framerate might also be an issue that should be looked into.

AlexeyAB commented 7 years ago

@fjoanr Hi,

So your error ocurs because you have not .txt-files with labeling. I.e. if you initially have dog.jpg and dog.txt with labels, and for this image you got 125 images dog_01.jpg, dog_02.jpg, ... dog_125.jpg. Then you should copy dog.txt to dog_01.txt, dog_02.txt, ... dog_125.txt.
The last conv layer outputs a 13x13x125 feature map. Since I want to store them as images, I divide each map into 125 outputs. Thus, the input to the network is 13x13x1 representing each depth slice of the feature map of 125 slices (using 500 train images I would input 62500 13x13 images to the fusion layer). ...

I'm not sure if this is the best way. But may be.
So, which 2 RGB images do you want to merge via FusionNet? Are these two RGB images of the same object on the same background, but from different points of view - i.e. 2 images from stereo-cameras?
And how will you merge these two RGB images?

Yes it was Kinect 2.0. And Kinect 1.0 is certainly worse, but it's still much better than stereo-cameras :) https://www.youtube.com/watch?v=Zx2E19IV2zs

fjoanr commented 7 years ago

Hello @AlexeyAB, sorry for the delay on answering.

I've managed to create the img_00.txt ... img_62499.txt files as you explained and the network is training right now!! HOORAYY! :D
About the use of the convolutional features, it might not be the best idea to save them this way but I didn't come with another solution so far... let's see how the training goes. So far, the network trained for ~200 iterations, the AVG is going down but I have noticed that the values for Obj, No Obj and Avg Recall are most of the times 0.0000... is it because I am in an early training stage or something might be wrong?
For the training of the fusionnet, I am applying 2 YOLOs to the same RGB image from the VOC database. It sounds reasonable that the output of this implementation should return the same values as the original YOLO, as I am not modifying anything and just working with 2 averaged YOLO outputs. This is the first run, if I find out it works as intended, the 2nd step would be to combine YOLO with another CNN implementation on one of the fusion streams (RCNN, ResNet or so).
About merging the images... I just averaged them so far, it might not be the best approach either but again, I need to try some implementations to see if it works out.

Do you think I need to train the fusion layer 2000*classes times? Since the network is just 3 layers?

.

That Kinect 1.0 video is exactly as the one I used for another project :D as you can see, all black pixels in the image are noisy regions, that the depth sensor is not able to capture correctly, so a lot of pre-processing is needed to make the regions smooth and constant.

AlexeyAB commented 7 years ago

@fjoanr Hi,

the AVG is going down but I have noticed that the values for Obj, No Obj and Avg Recall are most of the times 0.0000... is it because I am in an early training stage or something might be wrong?

Do you think I need to train the fusion layer 2000*classes times? Since the network is just 3 layers?

Yes, I think it is necessary to train 2000 iterations per class and test several previous weights.

fjoanr commented 7 years ago

Hi @AlexeyAB ,

I've been training the fusionnet as we've talked here but in the end when running darknet.exe detector recall ... all the values of IOU and Recall are 0%, so I guess my approach was not correct.

I will try a new approach, where I gather the convolutional features (13x13x125) splitted in 125 images of 13x13 (as before) but in the darknet.sln code I would like to merge the 125 channels together again, because OpenCV cannot save an image of 125 channels and darknet.exe detector is unable to read the 13x13x125 matrix from OpenCV, as this error pops-up:

captura

The problem is I am unable to find the place where the code reads the images. In void train_detector( ... ) from detector.c lines 54-56 the code reads:

 list *plist = get_paths(train_images);
 //int N = plist->size;
char **paths = (char **)list_to_array(plist);

Where I suppose it reads the paths from train.txt file. Where should I modify the code to add merging of single-channel images into a multidimensional matrix?

Thank you again!

AlexeyAB commented 7 years ago

@fjoanr Hi,

During training this code load images:

First load before loop here: pthread_t load_thread = load_data(args); https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/detector.c#L76
And next loads in loop:

load_thread = load_data(args); https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/detector.c#L103
- pthread_create(&thread, 0, load_threads, ptr) https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/data.c#L803
  - threads[i] = load_data_in_thread(args); https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/data.c#L782
    - pthread_create(&thread, 0, load_thread, ptr) https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/data.c#L764
      - load_data_region(a.n, a.paths, ... https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/data.c#L742
        
        image orig = load_image_color(random_paths[i], 0, 0); https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/data.c#L515
        
        load_image(filename, w, h, 3); https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/image.c#L1219
        
        image out = load_image_cv(filename, c); https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/image.c#L1204
        
        image load_image_cv(char *filename, int channels) https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/image.c#L498

So I think probably you should change load_data_region() https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/data.c#L500

You should remove all code from load_data_region(), then load you 125 images image orig = load_image_color(random_paths[i], 0, 0); in the loop, and fuse it to the d.X.vals. You should allocate size enought for 13x13x125 features in d.X.vals.
Or you can load one file which contains 13x13x125 features, if you saved layer-30 manually as one file.

fjoanr commented 7 years ago

Hello @AlexeyAB

Thank you for the reply once again! I will try to sort the issues out and I will mark these issue as closed.

Best regards, Francesc.