Closed saihv closed 6 years ago
Do you want to process 2 or 4 images simultaneously in one large batch to increase speed? Or do you just want to process many images for convenience?
You can use this command to process many images from train.txt
file and save results to the result.txt
file:
darknet.exe detector test data/voc.data yolo-voc.cfg yolo-voc.weights -dont_show < data/train.txt > result.txt
@AlexeyAB Some small number of images simultaneously to increase overall detection speed. The actual dataset consists of hundreds of images, I want to pass small sets of images to the GPU for detection. I am working on a Jetson TX2 for the detection, so I might have to keep the number low: which is why I mentioned 2 or 4. But it could be any number, technically.
Also, I am trying to do this within a Python wrapper for convenience (similar to the Python script available in the original Darknet repo), so instead of running detector
from the command line, I am trying to directly interface with the internal functions. I guess then I should look into the corresponding code for detector test
which handles multiple images?
No, there isn't implemented such feature, so you should implement it by yourself.
Something like this:
list *plist = get_paths(train_images);
char **paths = (char **)list_to_array(plist);
int number_of_images = plist->size;
float *X = calloc(net.batch*net.h*net.w*3, sizeof(float));
int i, k;
for(i = 0; i < number_of_images; i += 4) {
for(k = 0; k < net.batch; ++k) {
image orig = load_image_color(paths[k + i], 0, 0);
memcpy(X+k*net.h*net.w*3, sized.data, net.h*net.w*3*sizeof(float));
}
float *y = calloc(batch*d.y.cols, sizeof(float));
get_next_batch(d, net.batch, net.batch, X, y);
network_predict(net, X);
layer l = net.layers[net.n-1];
box *boxes = calloc(l.w*l.h*l.n, sizeof(box));
float **probs = calloc(l.w*l.h*l.n, sizeof(float *));
int j;
for(j = 0; j < l.w*l.h*l.n; ++j) probs[j] = calloc(l.classes, sizeof(float *));
for(k = 0; k < net.batch; ++k) {
get_region_boxes(l, 1, 1, thresh, probs, boxes, 0, 0);
if (nms) do_nms_sort(boxes, probs, l.w*l.h*l.n, l.classes, nms);
draw_detections(im, l.w*l.h*l.n, thresh, boxes, probs, names, alphabet, l.classes);
l.output = l.output + net.h*net.w*3;
}
}
Thanks a lot for the snippet! If I am reading it right, X
contains data from the "batch" of images, so is network_predict
already capable of handling a batch? Or would I need modifications on the CUDA side of things as well?
network_predict
handles batch of images. It in-turn calls network_predict_gpu
if you're using GPU for detection. You can have a look at the implementation here: https://github.com/AlexeyAB/darknet/blob/3e5abe0680c6112c9674204c22db7bd4b238d2b5/src/network_kernels.cu#L43 .
You don't have to modify anything on CUDA side.
Thanks a lot for the snippet! If I am reading it right, X contains data from the "batch" of images, so is network_predict already capable of handling a batch?
Yes.
I tried implementing batch processing, mainly inspired by the code snippet above: But I could only do it on pjreddie's version, because I needed Python interfacing, so I added some test code to network.c
. Batch processing seems to be executing fine, but I am having some trouble with the last step (retrieving the bounding boxes), so I was wondering if any of you might have some suggestions.
My code looks like this: (4 test images, test
is an input 'matrix' that has the 4 3-channel images arranged as 4 rows. net->batch currently set as 4, so the whole "predict" process executes in one shot.)
void network_detect_batch(network *net, matrix test, float thresh, float hier_thresh, float nms, box **boxes, float ***probs)
{
int i,j,b;
image im, imr;
im.w = 640;
im.h = 360;
im.c = 3;
int k = net->outputs;
matrix pred = make_matrix(test.rows, k);
float *X = calloc(net->batch*test.cols, sizeof(float));
for(i = 0; i < test.rows; i += net->batch){
for(b = 0; b < net->batch; ++b){
if(i+b == test.rows) break;
im.data = test.vals[i+b];
rgbgr_image(im);
imr = letterbox_image(im, net->w, net->h); // following the original pjreddie way of dealing with images
memcpy(X+b*test.cols, imr.data, test.cols*sizeof(float));
}
float *out = network_predict(net, X);
layer l = net->layers[net->n-1];
box *boxesBatch = calloc(l.w*l.h*l.n, sizeof(box));
float **probsBatch = calloc(l.w*l.h*l.n, sizeof(float *));
int j;
for(j = 0; j < l.w*l.h*l.n; ++j) probsBatch[j] = calloc(l.classes, sizeof(float *));
for(b = 0; b < net->batch; ++b){
get_region_boxes(l, 1, 1, net->w, net->h, thresh, probsBatch, boxesBatch, 0, 0, 0, hier_thresh, 0);
if (nms) do_nms_sort(boxesBatch, probsBatch, l.w*l.h*l.n, l.classes, nms);
boxes[i+b] = boxesBatch;
probs[i+b] = probsBatch;
l.output = l.output + net->h*net->w*3;
}
}
free(X);
}
The problem I am having is that the line l.output = l.output + net->h*net->w*3
doesn't seem to be working well. When I parse the results received from **boxes and ***probs (multiples), they are all empty. If I remove that line, then all four classifications and bounding boxes are corresponding only to the first image. Essentially:
Test image classes: person, boat, car, truck
With the line l.output = ...
: [], [], [], []
Without the line: person, person, person, person
So there seems to be a problem with 'stepping' between the multiple results coming from the batch. I realize this problem might be specific to the original repo, but any thoughts or suggestions on how to debug this would be very helpful, thanks!
This code should work with final feature map rather than input of network, so try to use this line
l.output = l.output + l.h*l.w*l.n;
instead of this line l.output = l.output + net->h*net->w*3;
for(b = 0; b < net->batch; ++b){
get_region_boxes(l, 1, 1, net->w, net->h, thresh, probsBatch, boxesBatch, 0, 0, 0, hier_thresh, 0);
if (nms) do_nms_sort(boxesBatch, probsBatch, l.w*l.h*l.n, l.classes, nms);
boxes[i+b] = boxesBatch;
probs[i+b] = probsBatch;
l.output = l.output + l.h*l.w*l.n;
}
Still not solved, unfortunately: but it seems like adding that line actually causes the code to give out wrong predictions for the images. Right now the output reads: "person, wakeboard, wakeboard, wakeboard" (another of my classes, but none of these four images are of 'wakeboard' class)
Update: Just tried a minimal test with this repo, where I read the images manually from their paths, concatenate them into X
and do a batch prediction (this way, I wanted to make sure nothing was wrong with the way I was sending data from external code). I am seeing the similar problem where the first image is predicted correctly and none of the others are. (both with steps of l.h*l.w*l.n
and net.h*net.w*3
)
Also look at this part of code - you allocate box *boxesBatch = calloc
and float **probsBatch = calloc
only once and then you copy many times the same pointer to the external array boxes[i+b] = boxesBatch;
and probs[i+b] = probsBatch;
.
But you should:
memcpy(boxes[i+b], boxesBatch, l.w*l.h*l.n * sizeof(box))
boxes[i+b] = boxesBatch; boxesBatch = calloc(l.w*l.h*l.n, sizeof(box));
box *boxesBatch = calloc(l.w*l.h*l.n, sizeof(box));
float **probsBatch = calloc(l.w*l.h*l.n, sizeof(float *));
int j;
for(j = 0; j < l.w*l.h*l.n; ++j) probsBatch[j] = calloc(l.classes, sizeof(float *));
for(b = 0; b < net->batch; ++b){
get_region_boxes(l, 1, 1, net->w, net->h, thresh, probsBatch, boxesBatch, 0, 0, 0, hier_thresh, 0);
if (nms) do_nms_sort(boxesBatch, probsBatch, l.w*l.h*l.n, l.classes, nms);
boxes[i+b] = boxesBatch;
probs[i+b] = probsBatch;
l.output = l.output + net->h*net->w*3;
}
I see that now! Thanks for the catch, will fix that.
In fact, in my new minimal test example, I removed all of those extra parts to avoid mistakes such as above. This is what my code looks like now, focusing only on image read, batch prediction and result retrieval, with which I have similar problems. Please note that this code was written in this repo and not the original pjreddie one
network net = parse_network_cfg_custom("./tiny-yolo.cfg", 4); // batch size as 4
image ime1 = load_image_color(path1,0,0);
image sized1 = resize_image(ime1, net.w, net.h);
image ime2 = load_image_color(path2,0,0);
image sized2 = resize_image(ime2, net.w, net.h);
image ime3 = load_image_color(path3,0,0);
image sized3 = resize_image(ime3, net.w, net.h);
image ime4 = load_image_color(path4,0,0);
image sized4 = resize_image(ime4, net.w, net.h);
memcpy(X+0*net.h*net.w*3, sized1.data, net.h*net.w*3*sizeof(float));
memcpy(X+1*net.h*net.w*3, sized2.data, net.h*net.w*3*sizeof(float));
memcpy(X+2*net.h*net.w*3, sized3.data, net.h*net.w*3*sizeof(float));
memcpy(X+3*net.h*net.w*3, sized4.data, net.h*net.w*3*sizeof(float));
printf("Starting prediction..");
time=clock();
network_predict(net, X);
printf("Predicted in %f seconds.\n", sec(clock()-time));
layer l = net.layers[net.n-1];
box *boxes1 = calloc(l.w*l.h*l.n, sizeof(box));
float **probs1 = calloc(l.w*l.h*l.n, sizeof(float *));
for(j = 0; j < l.w*l.h*l.n; ++j) probs1[j] = calloc(l.classes, sizeof(float *));
get_region_boxes(l, 640, 360, thresh, probs1, boxes1, 0, 0);
if (nms) do_nms_sort(boxes1, probs1, l.w*l.h*l.n, l.classes, nms);
draw_detections(ime1, l.w*l.h*l.n, thresh, boxes1, probs1, names, alphabet, l.classes); // gives me the correct result
l.output = l.output + net.h*net.w*3;
printf("First result retrieved");
box *boxes2 = calloc(l.w*l.h*l.n, sizeof(box));
float **probs2 = calloc(l.w*l.h*l.n, sizeof(float *));
for(j = 0; j < l.w*l.h*l.n; ++j) probs2[j] = calloc(l.classes, sizeof(float *));
get_region_boxes(l, 640, 360, thresh, probs2, boxes2, 0, 0);
if (nms) do_nms_sort(boxes2, probs2, l.w*l.h*l.n, l.classes, nms);
draw_detections(ime2, l.w*l.h*l.n, thresh, boxes2, probs2, names, alphabet, l.classes);
l.output = l.output + net.h*net.w*3;
printf("Second result retrieved");
box *boxes3 = calloc(l.w*l.h*l.n, sizeof(box));
float **probs3 = calloc(l.w*l.h*l.n, sizeof(float *));
for(j = 0; j < l.w*l.h*l.n; ++j) probs3[j] = calloc(l.classes, sizeof(float *));
get_region_boxes(l, 640, 360, thresh, probs3, boxes3, 0, 0);
if (nms) do_nms_sort(boxes3, probs3, l.w*l.h*l.n, l.classes, nms);
draw_detections(ime3, l.w*l.h*l.n, thresh, boxes3, probs3, names, alphabet, l.classes);
l.output = l.output + net.h*net.w*3;
printf("Third result retrieved");
box *boxes4 = calloc(l.w*l.h*l.n, sizeof(box));
float **probs4 = calloc(l.w*l.h*l.n, sizeof(float *));
for(j = 0; j < l.w*l.h*l.n; ++j) probs4[j] = calloc(l.classes, sizeof(float *));
get_region_boxes(l, 640, 360, thresh, probs4, boxes4, 0, 0);
if (nms) do_nms_sort(boxes4, probs4, l.w*l.h*l.n, l.classes, nms);
draw_detections(ime4, l.w*l.h*l.n, thresh, boxes4, probs4, names, alphabet, l.classes);
printf("Fourth result retrieved");
Update: I think I figured it out. The right step size between the outputs seems to be l.h*l.w*l.n*(l.classes + l.coords + 1);
, not l.h*l.w*l.n
or net.h*net.w*3
.
EDIT: Confirmed, it does work. Thanks for all the help! Closing the issue now.
@saihv Can you create a pull request to make batch processing for inference work out of the box ? Thanks !
Hi @alexanderfrey, FYI, these modifications were made on the original Darknet (pjreddie) version, not this fork, because I needed a Python wrapper.
The reason I did not submit a PR yet was mainly due to the fact that the performance in my case did not improve by much. At least in the original version, every image in the batch needs to go through a preprocessing stage (RGB to BGR conversion and letterboxing the image according to the network width and height), and because this process is not very optimized as of now and also due to the fact that it runs on the CPU sequentially image after image, it ended up being a big bottleneck even though the actual prediction happens for the whole batch. I am hoping to improve those functions too, once I get some time to look into them and then I can wrap up my changes as a PR.
Hi, right now i want to create batch inference process in c++. does this repository support c++ API to do batch inference? thank you very much in advance.
right now i want to create batch inference process in c++. does this repository support c++ API to do batch inference?
See here for an example on how to do that in C++ using the DarkHelp wrapper: https://www.ccoderun.ca/darkhelp/api/API.html
My understanding is Darknet also has a C++ API, but I've never used it. I only use the DarkHelp one.
I am using Darknet (and tiny YOLO) to perform object detection on an image dataset. I am currently performing inference sequentially on the test images using the GPU: but is it possible to process them in parallel: for example, process 2 or 4 images simultaneously? I have looked into
network.c
: where the functionnetwork_predict_data_multi
seems to be able to handle a set of images, but still only processes them one after the other in a loop. Is it possible to extendnetwork_predict_gpu()
(innetwork_kernels.cu
) to handle multiple images in parallel?