Closed fjoanr closed 7 years ago
@fjoanr Hi,
This is because the destructor is called twice.
If you create a Detector-object, then you should not call destructor:
{
Detector detector1("yolo-voc.cfg", "yolo-voc.weights");
Detector detector2("yolo-voc.cfg", "yolo-voc.weights");
Detector detector3("yolo-voc.cfg", "yolo-voc.weights");
} // destructor called automatically
{
Detector detector1("yolo-voc.cfg", "yolo-voc.weights");
Detector detector2("yolo-voc.cfg", "yolo-voc.weights");
Detector detector3("yolo-voc.cfg", "yolo-voc.weights");
detector1.~Detector();
detector2.~Detector();
detector3.~Detector();
} // destructor called automatically again
If you want to destroy Detector manually, then use std::shared_ptr<>
#include <memory>
{
std::shared_ptr<Detector> detector1 = std::make_shared<Detector>("yolo-voc.cfg", "yolo-voc.weights");
std::shared_ptr<Detector> detector2 = std::make_shared<Detector>("yolo-voc.cfg", "yolo-voc.weights");
std::shared_ptr<Detector> detector3 = std::make_shared<Detector>("yolo-voc.cfg", "yolo-voc.weights");
cv::Mat mat_img = cv::imread("x64/data/dog.jpg");
std::vector<bbox_t> result_vec = detector1->detect(mat_img, 0.2);
detector1.reset(); // destructor called manually
detector2.reset(); // destructor called manually
detector3.reset(); // destructor called manually
} // If the destructor has already been called, it is no longer called, otherwise destructors called
But I also note that there is bug in Yolo - that when the detector is repeatedly created and deleted many times, a memory leak occurs. So far I have not done this fix.
Hello @AlexeyAB
Thank you for the fast response. Since my program was running out of memory I decided to compile on an individual project the 2 YOLOs plus the neural net and saved the neural network's weights and then uploaded them onto the main project. It solved the issue with multiple detectors not being able to be created and deleted.
I now have a new issue. For the correct implementation of the fusionNet, I need to remove the last layer of Yolo, so it would end with the Convolutional layer
[convolutional]
size=1
stride=1
pad=1
filters=125
activation=linear
If I run the detector(), as the functions implies, it returns a vector
thank you again for your help, it is much appreciated.
Francesc.
@fjoanr Hi,
If you didn't remove last region
-layer from the .cfg-file, then you should implement your own detect()
-function.
You can get last layer layer l = net.layers[net.n - 1];
as in detect()
: https://github.com/AlexeyAB/darknet/blob/master/src/yolo_v2_class.cpp#L166
And implement your own get_region_boxes()
: https://github.com/AlexeyAB/darknet/blob/master/src/region_layer.c#L328
int i,j,n;
// 13*13*(20+5)*5 = 21125
float *predictions = l.output; // 21125 values for 13x13 WxH, 20 classes and 5 anchors
for (i = 0; i < l.w*l.h; ++i){ // 13x13 (W x H)
int row = i / l.w;
int col = i % l.w;
for(n = 0; n < l.n; ++n){ // 5 (anchors)
int index = i*l.n + n;
int pred_index = index * (l.classes + 5); // 13*13*5 * (20(classes) + 4(coords)+1(To))
float val = predictions[pred_index];
float val = predictions[pred_index + 1];
...
float val = predictions[pred_index + 24];
}
}
@fjoanr Which of 3 fusionNet do you use? https://arxiv.org/find/all/1/all:+FusionNet/0/1/0/all/0/1
@AlexeyAB Thank you for the info, I will check it out today to see if I can manage it!
About the fusionNet, I am using https://arxiv.org/abs/1507.06821 but with the two streams being ConvNets on RGB images. I am also investigating on using an extra convolutional layer instead of the fully connected layers, but we will see about it :D
@fjoanr Yes, this is an interesting topic, how much can improve the accuracy on top CNNs by using RGB-D compared with RGB .
What do you use to get Depth on image?
cv::gpu::StereoConstantSpaceBP
from OpenCV?Earlier I thought that it is possible to achieve good help from a depth-map only obtained from active cameras (ToF, Lidar ...). But now there is reason to believe that the depth-map obtained from passive stereo cameras also has enough accuracy to help in the recognition of objects. Autopilot on Tesla cars should uses 3D reconstruction and depth-map to help detect objects, and there is used passive optical cameras (there should be very good cameras, with excellent optics and high resolution): http://blog.ted.com/what-will-the-future-look-like-elon-musk-speaks-at-ted2017/
What’s happening at Tesla? Tesla Model 3 is coming in July, Musk says, and it’ll have a special feature: autopilot. Using only passive optical cameras and GPS, no LIDAR, the Model 3 will be capable of autonomous driving. “Once you solve cameras for vision, autonomy is solved; if you don’t solve vision, it’s not solved
Previously, the good quality of the depth-map could be obtained only by using active cameras, such on Kinect:
Hey @AlexeyAB
it is indeed a very interesting approach and probably with the correct testing it might be able to improve the quality of the currently state-of-the-art object detectors. In my case, I am not using a Depth-stream, but 2 RGB-streams. The idea of using footage from ToF or stereo cameras is interesting but don't you think the computational cost and the hardware costs for implementing a surveillance system would increase too much?
At one point this year I was using the Kinect to obtain depth images and worked on another human tracker, but the quality of the depth images from Kinect is (right now) far from ideal, i.e., it needs a lot of pre-processing to be useful in any kind of application.
I have been able to obtain and save the features from the last convolutional layer of YOLO. Now, when I try to train the new fusion-stream (with one convolutional layer and the region layer), I get an error in the console as in the image below:
The fusionnet.cfg file is:
[net]
# Testing
# batch=1
# subdivisions=1
# Training
batch=64
subdivisions=8
height=13
width=13
channels=1
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
learning_rate=0.001
burn_in=1000
max_batches = 80200
policy=steps
steps=40000,60000
scales=.1,.1
[convolutional]
size=1
stride=1
pad=0
filters=125
activation=linear
[region]
anchors = 1.3221, 1.73145, 3.19275, 4.00944, 5.05587, 8.09892, 9.47112, 4.84053, 11.2364, 10.0071
bias_match=1
classes=20
coords=4
num=5
softmax=1
jitter=.3
rescore=1
object_scale=5
noobject_scale=1
class_scale=1
coord_scale=1
absolute=1
thresh = .6
random=1
And the images are stored in darknet-master/CONVdevkit/conv and conv_test, with images being stored as 00_0x (x from 0 to 62500). Do you have any idea or suggestion on where should I look in the main code to understand the issue?
Thank you!
@fjoanr Hi,
Please, show content of files data/obj.data
and small part of content of file train.txt
Also what do you send to the input of network as 13x13x1? I.e. which part of the saved features 13x13x125 (from the last convolutional layer of YOLO) do you send to the input?
Did you save layer-30 (13x13x125) or layer-29 (13x13x1024)?
You should use for your own training only old yolo-voc.2.0.cfg
. And only for already trained yolo-voc.weigts
you can use new yolo-voc.cfg
I would recommend using yolo-voc.2.0.cfg
instead of yolo-voc.cfg
as a template for your own .cfg-file (or yolo.2.0.cfg
instead of yolo.cfg
)
(because new yolo-voc.cfg
used for already trained weights from https://pjreddie.com/darknet/yolo/ and requires additional code for training, that isn't well debuged yet, and that present now only on Linux, but isn't in this Windows fork)
There is some problem to train on Windows by using new yolo-voc.cfg
or yolo.cfg
: https://github.com/AlexeyAB/darknet/issues/71#issuecomment-298823064
Yes, on stereo-cameras computational cost of cv::gpu::StereoConstantSpaceBP
from OpenCV is very big, and it gives very bad distance map, that also requires very coarse and costly filtering (BilateralFilter, MedianFilter ...). On stereo-cameras with 700x500 I got about ~1 FPS and 70x50 resolution of depth-map, by using mGPU (GeForce GTX 645 - 800 GFlops).
But as I understand computational cost of depth-map from ToF is small enough, because ToF gives a ready-made distance map (doesn't require StereoConstantSpaceBP), and because it remains the only cost of filtration (BilateralFilter or MedianFilter ...). Kinect Fusion looks good enough: https://www.youtube.com/watch?v=ra3xxLepRfA
Lidar can sees much further and more precisely, but its price is much more expensive 10 000$ instead of 100$ for ToF
Hello,
classes= 20
train = data/conv/conv.txt
valid = data/conv/conv_test.txt
names = data/obj.names
backup = backup/
train.txt is my conv.txt file, with paths to the images as: C:/VGIS8/darknet-master/CONVdevkit/ConvOutput/2017_00.jpg, C:/VGIS8/darknet-master/CONVdevkit/ConvOutput/2017_01.jpg, ...
The last conv layer outputs a 13x13x125 feature map. Since I want to store them as images, I divide each map into 125 outputs. Thus, the input to the network is 13x13x1 representing each depth slice of the feature map of 125 slices (using 500 train images I would input 62500 13x13 images to the fusion layer). The layer I am saving is indeed the 13x13x125, and to obtain the same output size after the convolutional layer of my fusionnet, I apply the convolution without stride, to maintain the size.
I will modify the parameters on fusionnet.cfg with the parameters of yolo-voc.2.0.cfg, and test again to check if the network trains without pre-trained weights better.
.
That Kinect app looks pretty good in terms of accuracy indeed. Do you know if it is the new Kinect (I think it's called Kinect2.0) or the old one? I was only able to work with the old one that I had available here and the maximum fps it could work on was around 30 at 640x480 resolution, before pre-processing...
That is the issue, training a model with both depth and RGB for, i.e., the ILSVRC might be good enough because the pictures are not in open spaces nor big areas. However, if you want to design a model to work in real-life cameras for surveillance applications, the hardware costs would increase too much (at least for my project :D ) and dropping the framerate might also be an issue that should be looked into.
@fjoanr Hi,
So your error ocurs because you have not .txt-files with labeling. I.e. if you initially have dog.jpg
and dog.txt
with labels, and for this image you got 125 images dog_01.jpg
, dog_02.jpg
, ... dog_125.jpg
. Then you should copy dog.txt
to dog_01.txt
, dog_02.txt
, ... dog_125.txt
.
The last conv layer outputs a 13x13x125 feature map. Since I want to store them as images, I divide each map into 125 outputs. Thus, the input to the network is 13x13x1 representing each depth slice of the feature map of 125 slices (using 500 train images I would input 62500 13x13 images to the fusion layer). ...
I'm not sure if this is the best way. But may be.
So, which 2 RGB images do you want to merge via FusionNet? Are these two RGB images of the same object on the same background, but from different points of view - i.e. 2 images from stereo-cameras?
And how will you merge these two RGB images?
Yes it was Kinect 2.0. And Kinect 1.0 is certainly worse, but it's still much better than stereo-cameras :) https://www.youtube.com/watch?v=Zx2E19IV2zs
Hello @AlexeyAB, sorry for the delay on answering.
I've managed to create the img_00.txt ... img_62499.txt files as you explained and the network is training right now!! HOORAYY! :D
About the use of the convolutional features, it might not be the best idea to save them this way but I didn't come with another solution so far... let's see how the training goes. So far, the network trained for ~200 iterations, the AVG is going down but I have noticed that the values for Obj, No Obj and Avg Recall are most of the times 0.0000... is it because I am in an early training stage or something might be wrong?
For the training of the fusionnet, I am applying 2 YOLOs to the same RGB image from the VOC database. It sounds reasonable that the output of this implementation should return the same values as the original YOLO, as I am not modifying anything and just working with 2 averaged YOLO outputs. This is the first run, if I find out it works as intended, the 2nd step would be to combine YOLO with another CNN implementation on one of the fusion streams (RCNN, ResNet or so).
About merging the images... I just averaged them so far, it might not be the best approach either but again, I need to try some implementations to see if it works out.
Do you think I need to train the fusion layer 2000*classes times? Since the network is just 3 layers?
.
@fjoanr Hi,
the AVG is going down but I have noticed that the values for Obj, No Obj and Avg Recall are most of the times 0.0000... is it because I am in an early training stage or something might be wrong?
Do you think I need to train the fusion layer 2000*classes times? Since the network is just 3 layers?
Yes, I think it is necessary to train 2000 iterations per class and test several previous weights.
Hi @AlexeyAB ,
I've been training the fusionnet as we've talked here but in the end when running darknet.exe detector recall ...
all the values of IOU and Recall are 0%, so I guess my approach was not correct.
I will try a new approach, where I gather the convolutional features (13x13x125) splitted in 125 images of 13x13 (as before) but in the darknet.sln code I would like to merge the 125 channels together again, because OpenCV cannot save an image of 125 channels and darknet.exe detector is unable to read the 13x13x125 matrix from OpenCV, as this error pops-up:
The problem is I am unable to find the place where the code reads the images. In void train_detector( ... )
from detector.c
lines 54-56 the code reads:
list *plist = get_paths(train_images);
//int N = plist->size;
char **paths = (char **)list_to_array(plist);
Where I suppose it reads the paths from train.txt
file. Where should I modify the code to add merging of single-channel images into a multidimensional matrix?
Thank you again!
@fjoanr Hi,
During training this code load images:
First load before loop here: pthread_t load_thread = load_data(args);
https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/detector.c#L76
And next loads in loop:
load_thread = load_data(args);
https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/detector.c#L103
pthread_create(&thread, 0, load_threads, ptr)
https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/data.c#L803
threads[i] = load_data_in_thread(args);
https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/data.c#L782
pthread_create(&thread, 0, load_thread, ptr)
https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/data.c#L764
load_data_region(a.n, a.paths,
... https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/data.c#L742
image orig = load_image_color(random_paths[i], 0, 0);
https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/data.c#L515
load_image(filename, w, h, 3);
https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/image.c#L1219
image out = load_image_cv(filename, c);
https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/image.c#L1204
image load_image_cv(char *filename, int channels)
https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/image.c#L498So I think probably you should change load_data_region()
https://github.com/AlexeyAB/darknet/blob/d8bafc728478e5cba9cf41eca01d66a38800eddd/src/data.c#L500
You should remove all code from load_data_region()
, then load you 125 images image orig = load_image_color(random_paths[i], 0, 0);
in the loop, and fuse it to the d.X.vals
. You should allocate size enought for 13x13x125 features in d.X.vals
.
Or you can load one file which contains 13x13x125 features, if you saved layer-30 manually as one file.
Hello @AlexeyAB
Thank you for the reply once again! I will try to sort the issues out and I will mark these issue as closed.
Best regards, Francesc.
Hello Alex,
first of all, thank you for this repository, it works wonders so far!
I would like to know how to correctly delete a Detector instance when using YOLO as a DLL in my own VS project... What I have at the moment is:
Detector detector1("yolo-voc.cfg", "yolo-voc.weights");
Detector detector2("yolo-voc.cfg", "yolo-voc.weights");
...
some implementation to train an external Net
...
detector1.~Detector();
detector2.~Detector();
The problem is when compiling I get an error saying that "yolo_console_dll.exe has triggered a breakpoint"
which is in yolo_v2_class.cpp line 78:
free(detector_gpu.avg);
Do you have any idea how to destroy a Detector class variable or how could it be solved using the ~Detector() class?
thank you very much for your time,
best regards, Francesc.