test yolo4 in Jetson Nano - Building tensorRT cuda engine .. Killed!

ghimiredhikura commented 4 years ago

Hi,

Congratulations on your great work.

I am testing tkDNN in my Jetson NANO. Everything was fine. But when I tried to test yolo3/4 in FP16 mode, while building tensorRT cuda engine (after waiting around 40 min), it terminates with message Killed.

./test_yolo4

Any help please?

Best, Deepak

mive93 commented 4 years ago

Hi @ghimiredhikura, It's a memory error. We have never tested on Jeston Nano, but it could be possible that it does not have enough memory. Try to create the .rt running the test without any other memory-demanding processes.

ghimiredhikura commented 4 years ago

Hi @mive93,

Thanks for the quick response. I was only running the test but still it gets Killed. And at the moment I only have Jetson Nano running tkDNN. Therefore if possible can you provide me prebuilt yolo3_fp16.rt and yolo4_fp16.rt file. I would like to benchmark those models in J Nano.

Thanks.

ceccocats commented 4 years ago

TensorRT files are not portable. To build yolo4 you will need roughly 3.2gb of GPU memory and 1.8 gb of HOST memory wich even with swap is too much for the jetson nano. tkDNN is not perfecly optimized in term of memory during the building of the network since it allocates the memory to infer both in CUDNN and tensorRT. Right now the CUDNN part is mandatory to build the RT. Anyway if you pull to the last commit, you can do a little trick to deallocate most of the memory not used by tensorrt to build the model:

    for(int i=0; i<net->num_layers; i++) {
        if(net->layers[i]->getLayerType() == tk::dnn::LAYER_CONV2D) {
            tk::dnn::Conv2d *c = (tk::dnn::Conv2d*) net->layers[i];
            c->releaseDevice();
            c->releaseHost(true, false);
        }
        if(net->layers[i]->dstData != nullptr) {
            cudaFree(net->layers[i]->dstData);
            net->layers[i]->dstData = nullptr;
        }
   }

Do this before tk::dnn::NetworkRT creation and after tk::dnn::Network creation It will create the RT and then segfault during the infer. Comment the code and reexecute to get the inference loading the created file. With this trick you will use around 2.2 gigs of GPU and 1.5 Gigs of HOST memory. Be sure to have as much memory needed. Close all useless programs and IDEs, disable graphics...

ceccocats commented 4 years ago

Also try to comment out this in Conv2d.cpp during the creation of tensorRT file

    if (ws_sizeInBytes!=0) {
        checkCuda( cudaMalloc(&workSpace, ws_sizeInBytes) );
    }

ghimiredhikura commented 4 years ago

@ceccocats,

Thank you so much for saving me :). It works perfectly for all FP32, FP16 and INT8 in case of yolo4, whereas FP32, FP16 is working in case of yolo4_berkeley.

But while I use in INT8 mode for yolo4_berkeley it produce following error. I think you may want to check this.

mive93 commented 4 years ago

Hi @ghimiredhikura, I think you're missing some steps here. It's our fault because the readme wasn't very clear (I have update it). https://github.com/ceccocats/tkDNN/blob/master/README.md#int8-inference

Please, check now if you are doing everything correctly. I have test it on the Xavier and it works fine.

AleBasso80 commented 4 years ago

Hi @ghimiredhikura you can also allocate a large swap area (not zram, but a normal swap partition).

ghimiredhikura commented 4 years ago

Hi,

@mive93 @AleBasso80, thanks for helping me.

It works perfectly. However while converting to int8 I did not make the following change.

Also try to comment out this in Conv2d.cpp during the creation of tensorRT file
    if (ws_sizeInBytes!=0) {
        checkCuda( cudaMalloc(&workSpace, ws_sizeInBytes) );
    }

Just follow these steps and it will work.

TensorRT files are not portable. To build yolo4 you will need roughly 3.2gb of GPU memory and 1.8 gb of HOST memory wich even with swap is too much for the jetson nano. tkDNN is not perfecly optimized in term of memory during the building of the network since it allocates the memory to infer both in CUDNN and tensorRT. Right now the CUDNN part is mandatory to build the RT. Anyway if you pull to the last commit, you can do a little trick to deallocate most of the memory not used by tensorrt to build the model:
    for(int i=0; i<net->num_layers; i++) {
        if(net->layers[i]->getLayerType() == tk::dnn::LAYER_CONV2D) {
            tk::dnn::Conv2d *c = (tk::dnn::Conv2d*) net->layers[i];
            c->releaseDevice();
            c->releaseHost(true, false);
        }
        if(net->layers[i]->dstData != nullptr) {
            cudaFree(net->layers[i]->dstData);
            net->layers[i]->dstData = nullptr;
        }
   }
Do this before tk::dnn::NetworkRT creation and after tk::dnn::Network creation It will create the RT and then segfault during the infer. Comment the code and reexecute to get the inference loading the created file. With this trick you will use around 2.2 gigs of GPU and 1.5 Gigs of HOST memory. Be sure to have as much memory needed. Close all useless programs and IDEs, disable graphics...

thancaocuong commented 4 years ago

@ceccocats I did add the following lines into yolov4.cpp befor tk::dnn::NetworkRT creation. but got this error : Null pointer. Please tell what was I wrong?

        if(net->layers[i]->getLayerType() == tk::dnn::LAYER_CONV2D) {
            tk::dnn::Conv2d *c = (tk::dnn::Conv2d*) net->layers[i];
            c->releaseDevice();
            c->releaseHost(true, false);
        }
        if(net->layers[i]->dstData != nullptr) {
            cudaFree(net->layers[i]->dstData);
            net->layers[i]->dstData = nullptr;
        }
   }

ghimiredhikura commented 4 years ago

@ceccocats I did add the following lines into yolov4.cpp befor tk::dnn::NetworkRT creation. but got this error : Null pointer. Please tell what was I wrong?

        if(net->layers[i]->getLayerType() == tk::dnn::LAYER_CONV2D) {
            tk::dnn::Conv2d *c = (tk::dnn::Conv2d*) net->layers[i];
            c->releaseDevice();
            c->releaseHost(true, false);
        }
        if(net->layers[i]->dstData != nullptr) {
            cudaFree(net->layers[i]->dstData);
            net->layers[i]->dstData = nullptr;
        }
   }

Try without commenting out those lines in src/Conv2d.cpp

 if (ws_sizeInBytes!=0) {
        checkCuda( cudaMalloc(&workSpace, ws_sizeInBytes) );
  }

ceccocats commented 4 years ago

Hi, sorry I missed this question... If you are testing fp32 comment this line: c->releaseHost(true, false);

thancaocuong commented 4 years ago

thank you @ceccocats @ghimiredhikura . I will try it.

ghimiredhikura commented 4 years ago

Hi @ceccocats, I am now testing with custom yolo4 model with 512x512 input. The issue i'm facing is (again in jetson nano), getting tensorrt file, it says Killed in case of int8, other fp32 and fp16 are fine. I tried all tricks discussed above, nothing worked! Is there any other way to free GPU memory. Thanks.

...
264 ActivationLeaky   16 x   16,  512  ->   16 x   16,  512
265 Conv2d            16 x   16,  512  ->   16 x   16, 1024
266 ActivationLeaky   16 x   16, 1024  ->   16 x   16, 1024
267 Conv2d            16 x   16, 1024  ->   16 x   16,   48
268 Yolo              16 x   16,   48  ->   16 x   16,   48
===========================================================

GPU free memory: 504.472 mb.
New NetworkRT (TensorRT v6.01)
Float16 support: 1
Int8 support: 0
DLAs: 0
Selected maxBatchSize: 1
GPU free memory: 443.138 mb.
Building tensorRT cuda engine...
Killed

ceccocats commented 4 years ago

Actually you can't run int8 on Jetson nano, the hardware doesn't permit it, as said when Network RT starts:

New NetworkRT (TensorRT v6.01)
Float16 support: 1
Int8 support: 0
DLAs: 0

ghimiredhikura commented 4 years ago

Hi @ceccocats,

Thank you for the feedback. yes I was also wondering why the fps using int8 and fp32 is same :). So one more quesitons is in jetson-nano is the hardware problem or the functions is not implemented yet to support int8.

ceccocats commented 4 years ago

The int8 inference is not supported by the hardware of Jetson Nano.

ghimiredhikura commented 4 years ago

Thanks ^^. Closing.

mochechan commented 4 years ago

for(int i=0; i<net->num_layers; i++) {
    if(net->layers[i]->getLayerType() == tk::dnn::LAYER_CONV2D) {
        tk::dnn::Conv2d *c = (tk::dnn::Conv2d*) net->layers[i];
        c->releaseDevice();
        c->releaseHost(true, false);
    }
    if(net->layers[i]->dstData != nullptr) {
        cudaFree(net->layers[i]->dstData);
        net->layers[i]->dstData = nullptr;
    }

}

if (ws_sizeInBytes!=0) {
    checkCuda( cudaMalloc(&workSpace, ws_sizeInBytes) );
}

The following error occurs when using the previous two modifications. How to solve this problem? Thank you.

GPU free memory: 570.851 mb. New NetworkRT (TensorRT v7.13) Float16 support: 1 Int8 support: 0 DLAs: 0 TENSORRT LOG: Parameter check failed at: ../builder/Network.cpp::addConvolutionNd::718, condition: kernelWeights.values != nullptr Null pointer /home/a/tkDNN/src/NetworkRT.cpp:317 Aborting...

ceccocats / tkDNN

test yolo4 in Jetson Nano - Building tensorRT cuda engine .. Killed! #26