ceccocats / tkDNN

Deep neural network library and toolkit to do high performace inference on NVIDIA jetson platforms
GNU General Public License v2.0
717 stars 209 forks source link

How to run inference of multiple rt file #191

Closed mprabhakarece closed 2 years ago

mprabhakarece commented 3 years ago

Hi,

I have two weight files. And converted them into rt files. Now I am able to run inference of the converted rt files separately. But I want to run both rt files in single application. Kindly let me know the possibilities and procedure.

Thank you in advance.

Regards, Prabhakar M

mprabhakarece commented 3 years ago

When I try to do inference by creating one more Dnn object as tk::dnn::DetectionNN *detNN1; and passing my rt file without changing any other components I am getting segmentation fault error.

mive93 commented 3 years ago

Hi @mprabhakarece,

I have just fast tried that, and I don't get any error. This is my (crappy) way to modify the demo to test it:

    tk::dnn::Yolo3Detection yolo2;
    tk::dnn::DetectionNN *detNN2;  
    std::string net2 = "yolo4_fp32.rt";

    switch(ntype)
    {
        case 'y':
            detNN = &yolo;
            detNN2 = &yolo2;
            break;
        case 'c':
            detNN = &cnet;
            break;
        case 'm':
            detNN = &mbnet;
            n_classes++;
            break;
        default:
        FatalError("Network type not allowed (3rd parameter)\n");
    }

    detNN->init(net, n_classes, n_batch, conf_thresh);
    detNN2->init(net2, n_classes, n_batch, conf_thresh);
peterlee909 commented 3 years ago

Hi @mive93

I ran two models in different terminal: ./demo yolo4_fp32.rt ../demo/yolo_test.mp4 y

I found that the FPS reduced. Is that correct? Is there anything I could do to make FPS not to reduce? I think it's same that run two models in one process or run separately in two processes? Am I right?

mive93 commented 3 years ago

Hi @peterlee909 yeah, it's total normal that you have degradation in FPS, you're doing two inferences instead of only one. And yes, it's almost the same as running two separate processes. It depends on your application of course, but the only thing you can do to run more inferences at the same time and not double the latency is exploiting batching, but that means that you have to use the same model.

peterlee909 commented 3 years ago

@mive93 Thank you so much! According to the NVIDIA's information, it's because only one GPU context can be active at a time. Well, that's a little embarrassing.

mive93 commented 2 years ago

Closing for inactivity. Feel free to reopen.