Is it possible to batch process images during inference?

saihv commented 6 years ago

I am using Darknet (and tiny YOLO) to perform object detection on an image dataset. I am currently performing inference sequentially on the test images using the GPU: but is it possible to process them in parallel: for example, process 2 or 4 images simultaneously? I have looked into network.c: where the function network_predict_data_multi seems to be able to handle a set of images, but still only processes them one after the other in a loop. Is it possible to extend network_predict_gpu() (in network_kernels.cu) to handle multiple images in parallel?

AlexeyAB commented 6 years ago

Do you want to process 2 or 4 images simultaneously in one large batch to increase speed? Or do you just want to process many images for convenience?

You can use this command to process many images from train.txt file and save results to the result.txt file:

darknet.exe detector test data/voc.data yolo-voc.cfg yolo-voc.weights -dont_show < data/train.txt > result.txt

saihv commented 6 years ago

@AlexeyAB Some small number of images simultaneously to increase overall detection speed. The actual dataset consists of hundreds of images, I want to pass small sets of images to the GPU for detection. I am working on a Jetson TX2 for the detection, so I might have to keep the number low: which is why I mentioned 2 or 4. But it could be any number, technically.

Also, I am trying to do this within a Python wrapper for convenience (similar to the Python script available in the original Darknet repo), so instead of running detector from the command line, I am trying to directly interface with the internal functions. I guess then I should look into the corresponding code for detector test which handles multiple images?

AlexeyAB commented 6 years ago

No, there isn't implemented such feature, so you should implement it by yourself.

Something like this:

list *plist = get_paths(train_images);
char **paths = (char **)list_to_array(plist);
int number_of_images = plist->size;

float *X = calloc(net.batch*net.h*net.w*3, sizeof(float));

int i, k;
for(i = 0; i < number_of_images; i += 4) {
  for(k = 0; k < net.batch; ++k) {
     image orig = load_image_color(paths[k + i], 0, 0);
     memcpy(X+k*net.h*net.w*3, sized.data, net.h*net.w*3*sizeof(float));
  }

  float *y = calloc(batch*d.y.cols, sizeof(float));
  get_next_batch(d, net.batch, net.batch, X, y);
  network_predict(net, X);

  layer l = net.layers[net.n-1];
 box *boxes = calloc(l.w*l.h*l.n, sizeof(box));
 float **probs = calloc(l.w*l.h*l.n, sizeof(float *));
 int j;
 for(j = 0; j < l.w*l.h*l.n; ++j) probs[j] = calloc(l.classes, sizeof(float *));

  for(k = 0; k < net.batch; ++k) {
        get_region_boxes(l, 1, 1, thresh, probs, boxes, 0, 0);
        if (nms) do_nms_sort(boxes, probs, l.w*l.h*l.n, l.classes, nms);
        draw_detections(im, l.w*l.h*l.n, thresh, boxes, probs, names, alphabet, l.classes);
    l.output = l.output + net.h*net.w*3;
 }
}

saihv commented 6 years ago

Thanks a lot for the snippet! If I am reading it right, X contains data from the "batch" of images, so is network_predict already capable of handling a batch? Or would I need modifications on the CUDA side of things as well?

sivagnanamn commented 6 years ago

network_predict handles batch of images. It in-turn calls network_predict_gpu if you're using GPU for detection. You can have a look at the implementation here: https://github.com/AlexeyAB/darknet/blob/3e5abe0680c6112c9674204c22db7bd4b238d2b5/src/network_kernels.cu#L43 . You don't have to modify anything on CUDA side.

AlexeyAB commented 6 years ago

Thanks a lot for the snippet! If I am reading it right, X contains data from the "batch" of images, so is network_predict already capable of handling a batch?

Yes.

saihv commented 6 years ago

I tried implementing batch processing, mainly inspired by the code snippet above: But I could only do it on pjreddie's version, because I needed Python interfacing, so I added some test code to network.c. Batch processing seems to be executing fine, but I am having some trouble with the last step (retrieving the bounding boxes), so I was wondering if any of you might have some suggestions.

My code looks like this: (4 test images, test is an input 'matrix' that has the 4 3-channel images arranged as 4 rows. net->batch currently set as 4, so the whole "predict" process executes in one shot.)

void network_detect_batch(network *net, matrix test, float thresh, float hier_thresh, float nms, box **boxes, float ***probs)
{
    int i,j,b;
    image im, imr;
    im.w = 640;
    im.h = 360;
    im.c = 3;
    int k = net->outputs;
    matrix pred = make_matrix(test.rows, k);
    float *X = calloc(net->batch*test.cols, sizeof(float));
    for(i = 0; i < test.rows; i += net->batch){
        for(b = 0; b < net->batch; ++b){
            if(i+b == test.rows) break;
            im.data = test.vals[i+b];
            rgbgr_image(im);
            imr = letterbox_image(im, net->w, net->h);  // following the original pjreddie way of dealing with images
            memcpy(X+b*test.cols, imr.data, test.cols*sizeof(float));
        }

        float *out = network_predict(net, X);
        layer l = net->layers[net->n-1];
        box *boxesBatch = calloc(l.w*l.h*l.n, sizeof(box));
        float **probsBatch = calloc(l.w*l.h*l.n, sizeof(float *));
        int j;
        for(j = 0; j < l.w*l.h*l.n; ++j) probsBatch[j] = calloc(l.classes, sizeof(float *));
        for(b = 0; b < net->batch; ++b){
            get_region_boxes(l, 1, 1, net->w, net->h, thresh, probsBatch, boxesBatch, 0, 0, 0, hier_thresh, 0);
            if (nms) do_nms_sort(boxesBatch, probsBatch, l.w*l.h*l.n, l.classes, nms);

            boxes[i+b] = boxesBatch;
            probs[i+b] = probsBatch;
            l.output = l.output + net->h*net->w*3;
        }   
    }
    free(X);
}

The problem I am having is that the line l.output = l.output + net->h*net->w*3 doesn't seem to be working well. When I parse the results received from **boxes and ***probs (multiples), they are all empty. If I remove that line, then all four classifications and bounding boxes are corresponding only to the first image. Essentially:

Test image classes: person, boat, car, truck With the line l.output = ...: [], [], [], [] Without the line: person, person, person, person

So there seems to be a problem with 'stepping' between the multiple results coming from the batch. I realize this problem might be specific to the original repo, but any thoughts or suggestions on how to debug this would be very helpful, thanks!

AlexeyAB commented 6 years ago

This code should work with final feature map rather than input of network, so try to use this line l.output = l.output + l.h*l.w*l.n; instead of this line l.output = l.output + net->h*net->w*3;

        for(b = 0; b < net->batch; ++b){
            get_region_boxes(l, 1, 1, net->w, net->h, thresh, probsBatch, boxesBatch, 0, 0, 0, hier_thresh, 0);
            if (nms) do_nms_sort(boxesBatch, probsBatch, l.w*l.h*l.n, l.classes, nms);

            boxes[i+b] = boxesBatch;
            probs[i+b] = probsBatch;
            l.output = l.output + l.h*l.w*l.n;
        }

saihv commented 6 years ago

Still not solved, unfortunately: but it seems like adding that line actually causes the code to give out wrong predictions for the images. Right now the output reads: "person, wakeboard, wakeboard, wakeboard" (another of my classes, but none of these four images are of 'wakeboard' class)

saihv commented 6 years ago

Update: Just tried a minimal test with this repo, where I read the images manually from their paths, concatenate them into X and do a batch prediction (this way, I wanted to make sure nothing was wrong with the way I was sending data from external code). I am seeing the similar problem where the first image is predicted correctly and none of the others are. (both with steps of l.h*l.w*l.n and net.h*net.w*3)

AlexeyAB commented 6 years ago

Also look at this part of code - you allocate box *boxesBatch = calloc and float **probsBatch = calloc only once and then you copy many times the same pointer to the external array boxes[i+b] = boxesBatch; and probs[i+b] = probsBatch;.

But you should:

or copy array instead of pointer: memcpy(boxes[i+b], boxesBatch, l.w*l.h*l.n * sizeof(box))
or copy pointer, and allocate new array for the next image in the batch: boxes[i+b] = boxesBatch; boxesBatch = calloc(l.w*l.h*l.n, sizeof(box));

        box *boxesBatch = calloc(l.w*l.h*l.n, sizeof(box));
        float **probsBatch = calloc(l.w*l.h*l.n, sizeof(float *));
        int j;
        for(j = 0; j < l.w*l.h*l.n; ++j) probsBatch[j] = calloc(l.classes, sizeof(float *));
        for(b = 0; b < net->batch; ++b){
            get_region_boxes(l, 1, 1, net->w, net->h, thresh, probsBatch, boxesBatch, 0, 0, 0, hier_thresh, 0);
            if (nms) do_nms_sort(boxesBatch, probsBatch, l.w*l.h*l.n, l.classes, nms);

            boxes[i+b] = boxesBatch;
            probs[i+b] = probsBatch;
            l.output = l.output + net->h*net->w*3;
        }

saihv commented 6 years ago

I see that now! Thanks for the catch, will fix that.

In fact, in my new minimal test example, I removed all of those extra parts to avoid mistakes such as above. This is what my code looks like now, focusing only on image read, batch prediction and result retrieval, with which I have similar problems. Please note that this code was written in this repo and not the original pjreddie one

        network net = parse_network_cfg_custom("./tiny-yolo.cfg", 4); // batch size as 4
        image ime1 = load_image_color(path1,0,0);
        image sized1 = resize_image(ime1, net.w, net.h);

        image ime2 = load_image_color(path2,0,0);
        image sized2 = resize_image(ime2, net.w, net.h);

        image ime3 = load_image_color(path3,0,0);
        image sized3 = resize_image(ime3, net.w, net.h);

        image ime4 = load_image_color(path4,0,0);
        image sized4 = resize_image(ime4, net.w, net.h);

        memcpy(X+0*net.h*net.w*3, sized1.data, net.h*net.w*3*sizeof(float));
        memcpy(X+1*net.h*net.w*3, sized2.data, net.h*net.w*3*sizeof(float));
        memcpy(X+2*net.h*net.w*3, sized3.data, net.h*net.w*3*sizeof(float));
        memcpy(X+3*net.h*net.w*3, sized4.data, net.h*net.w*3*sizeof(float));       

        printf("Starting prediction..");
        time=clock();
        network_predict(net, X);
        printf("Predicted in %f seconds.\n", sec(clock()-time));

        layer l = net.layers[net.n-1];
        box *boxes1 = calloc(l.w*l.h*l.n, sizeof(box));
        float **probs1 = calloc(l.w*l.h*l.n, sizeof(float *));
        for(j = 0; j < l.w*l.h*l.n; ++j) probs1[j] = calloc(l.classes, sizeof(float *));        
        get_region_boxes(l, 640, 360, thresh, probs1, boxes1, 0, 0);
        if (nms) do_nms_sort(boxes1, probs1, l.w*l.h*l.n, l.classes, nms);
        draw_detections(ime1, l.w*l.h*l.n, thresh, boxes1, probs1, names, alphabet, l.classes);  // gives me the correct result
        l.output = l.output + net.h*net.w*3;
        printf("First result retrieved");               

        box *boxes2 = calloc(l.w*l.h*l.n, sizeof(box));
        float **probs2 = calloc(l.w*l.h*l.n, sizeof(float *));
        for(j = 0; j < l.w*l.h*l.n; ++j) probs2[j] = calloc(l.classes, sizeof(float *));
        get_region_boxes(l, 640, 360, thresh, probs2, boxes2, 0, 0);
        if (nms) do_nms_sort(boxes2, probs2, l.w*l.h*l.n, l.classes, nms);
        draw_detections(ime2, l.w*l.h*l.n, thresh, boxes2, probs2, names, alphabet, l.classes);
        l.output = l.output + net.h*net.w*3;
        printf("Second result retrieved");

        box *boxes3 = calloc(l.w*l.h*l.n, sizeof(box));
        float **probs3 = calloc(l.w*l.h*l.n, sizeof(float *));
        for(j = 0; j < l.w*l.h*l.n; ++j) probs3[j] = calloc(l.classes, sizeof(float *));
        get_region_boxes(l, 640, 360, thresh, probs3, boxes3, 0, 0);
        if (nms) do_nms_sort(boxes3, probs3, l.w*l.h*l.n, l.classes, nms);
        draw_detections(ime3, l.w*l.h*l.n, thresh, boxes3, probs3, names, alphabet, l.classes);
        l.output = l.output + net.h*net.w*3;
        printf("Third result retrieved");

        box *boxes4 = calloc(l.w*l.h*l.n, sizeof(box));
        float **probs4 = calloc(l.w*l.h*l.n, sizeof(float *));
        for(j = 0; j < l.w*l.h*l.n; ++j) probs4[j] = calloc(l.classes, sizeof(float *));
        get_region_boxes(l, 640, 360, thresh, probs4, boxes4, 0, 0);
        if (nms) do_nms_sort(boxes4, probs4, l.w*l.h*l.n, l.classes, nms);
        draw_detections(ime4, l.w*l.h*l.n, thresh, boxes4, probs4, names, alphabet, l.classes);
        printf("Fourth result retrieved");

saihv commented 6 years ago

Update: I think I figured it out. The right step size between the outputs seems to be l.h*l.w*l.n*(l.classes + l.coords + 1);, not l.h*l.w*l.n or net.h*net.w*3.

EDIT: Confirmed, it does work. Thanks for all the help! Closing the issue now.

alexanderfrey commented 6 years ago

@saihv Can you create a pull request to make batch processing for inference work out of the box ? Thanks !

saihv commented 6 years ago

Hi @alexanderfrey, FYI, these modifications were made on the original Darknet (pjreddie) version, not this fork, because I needed a Python wrapper.

The reason I did not submit a PR yet was mainly due to the fact that the performance in my case did not improve by much. At least in the original version, every image in the batch needs to go through a preprocessing stage (RGB to BGR conversion and letterboxing the image according to the network width and height), and because this process is not very optimized as of now and also due to the fact that it runs on the CPU sequentially image after image, it ended up being a big bottleneck even though the actual prediction happens for the whole batch. I am hoping to improve those functions too, once I get some time to look into them and then I can wrap up my changes as a PR.

albertchristianto commented 4 years ago

Hi, right now i want to create batch inference process in c++. does this repository support c++ API to do batch inference? thank you very much in advance.

stephanecharette commented 4 years ago

right now i want to create batch inference process in c++. does this repository support c++ API to do batch inference?

See here for an example on how to do that in C++ using the DarkHelp wrapper: https://www.ccoderun.ca/darkhelp/api/API.html

My understanding is Darknet also has a C++ API, but I've never used it. I only use the DarkHelp one.

AlexeyAB / darknet

Is it possible to batch process images during inference? #483