Closed ghimiredhikura closed 4 years ago
Hi @ghimiredhikura, It's a memory error. We have never tested on Jeston Nano, but it could be possible that it does not have enough memory. Try to create the .rt running the test without any other memory-demanding processes.
Hi @mive93,
Thanks for the quick response. I was only running the test but still it gets Killed. And at the moment I only have Jetson Nano running tkDNN. Therefore if possible can you provide me prebuilt yolo3_fp16.rt and yolo4_fp16.rt file. I would like to benchmark those models in J Nano.
Thanks.
TensorRT files are not portable. To build yolo4 you will need roughly 3.2gb of GPU memory and 1.8 gb of HOST memory wich even with swap is too much for the jetson nano. tkDNN is not perfecly optimized in term of memory during the building of the network since it allocates the memory to infer both in CUDNN and tensorRT. Right now the CUDNN part is mandatory to build the RT. Anyway if you pull to the last commit, you can do a little trick to deallocate most of the memory not used by tensorrt to build the model:
for(int i=0; i<net->num_layers; i++) {
if(net->layers[i]->getLayerType() == tk::dnn::LAYER_CONV2D) {
tk::dnn::Conv2d *c = (tk::dnn::Conv2d*) net->layers[i];
c->releaseDevice();
c->releaseHost(true, false);
}
if(net->layers[i]->dstData != nullptr) {
cudaFree(net->layers[i]->dstData);
net->layers[i]->dstData = nullptr;
}
}
Do this before tk::dnn::NetworkRT creation and after tk::dnn::Network creation It will create the RT and then segfault during the infer. Comment the code and reexecute to get the inference loading the created file. With this trick you will use around 2.2 gigs of GPU and 1.5 Gigs of HOST memory. Be sure to have as much memory needed. Close all useless programs and IDEs, disable graphics...
Also try to comment out this in Conv2d.cpp during the creation of tensorRT file
if (ws_sizeInBytes!=0) {
checkCuda( cudaMalloc(&workSpace, ws_sizeInBytes) );
}
@ceccocats,
Thank you so much for saving me :). It works perfectly for all FP32, FP16 and INT8 in case of yolo4, whereas FP32, FP16 is working in case of yolo4_berkeley.
But while I use in INT8 mode for yolo4_berkeley it produce following error. I think you may want to check this.
Hi @ghimiredhikura, I think you're missing some steps here. It's our fault because the readme wasn't very clear (I have update it). https://github.com/ceccocats/tkDNN/blob/master/README.md#int8-inference
Please, check now if you are doing everything correctly. I have test it on the Xavier and it works fine.
Hi @ghimiredhikura you can also allocate a large swap area (not zram, but a normal swap partition).
Hi,
@mive93 @AleBasso80, thanks for helping me.
It works perfectly. However while converting to int8 I did not make the following change.
Also try to comment out this in Conv2d.cpp during the creation of tensorRT file
if (ws_sizeInBytes!=0) { checkCuda( cudaMalloc(&workSpace, ws_sizeInBytes) ); }
Just follow these steps and it will work.
TensorRT files are not portable. To build yolo4 you will need roughly 3.2gb of GPU memory and 1.8 gb of HOST memory wich even with swap is too much for the jetson nano. tkDNN is not perfecly optimized in term of memory during the building of the network since it allocates the memory to infer both in CUDNN and tensorRT. Right now the CUDNN part is mandatory to build the RT. Anyway if you pull to the last commit, you can do a little trick to deallocate most of the memory not used by tensorrt to build the model:
for(int i=0; i<net->num_layers; i++) { if(net->layers[i]->getLayerType() == tk::dnn::LAYER_CONV2D) { tk::dnn::Conv2d *c = (tk::dnn::Conv2d*) net->layers[i]; c->releaseDevice(); c->releaseHost(true, false); } if(net->layers[i]->dstData != nullptr) { cudaFree(net->layers[i]->dstData); net->layers[i]->dstData = nullptr; } }
Do this before tk::dnn::NetworkRT creation and after tk::dnn::Network creation It will create the RT and then segfault during the infer. Comment the code and reexecute to get the inference loading the created file. With this trick you will use around 2.2 gigs of GPU and 1.5 Gigs of HOST memory. Be sure to have as much memory needed. Close all useless programs and IDEs, disable graphics...
@ceccocats I did add the following lines into yolov4.cpp befor tk::dnn::NetworkRT creation. but got this error : Null pointer. Please tell what was I wrong?
if(net->layers[i]->getLayerType() == tk::dnn::LAYER_CONV2D) {
tk::dnn::Conv2d *c = (tk::dnn::Conv2d*) net->layers[i];
c->releaseDevice();
c->releaseHost(true, false);
}
if(net->layers[i]->dstData != nullptr) {
cudaFree(net->layers[i]->dstData);
net->layers[i]->dstData = nullptr;
}
}
@ceccocats I did add the following lines into yolov4.cpp befor tk::dnn::NetworkRT creation. but got this error : Null pointer. Please tell what was I wrong?
if(net->layers[i]->getLayerType() == tk::dnn::LAYER_CONV2D) { tk::dnn::Conv2d *c = (tk::dnn::Conv2d*) net->layers[i]; c->releaseDevice(); c->releaseHost(true, false); } if(net->layers[i]->dstData != nullptr) { cudaFree(net->layers[i]->dstData); net->layers[i]->dstData = nullptr; } }
Try without commenting out those lines in src/Conv2d.cpp
if (ws_sizeInBytes!=0) {
checkCuda( cudaMalloc(&workSpace, ws_sizeInBytes) );
}
Hi, sorry I missed this question...
If you are testing fp32 comment this line:
c->releaseHost(true, false);
thank you @ceccocats @ghimiredhikura . I will try it.
Hi @ceccocats, I am now testing with custom yolo4 model with 512x512 input. The issue i'm facing is (again in jetson nano), getting tensorrt file, it says Killed
in case of int8, other fp32 and fp16 are fine. I tried all tricks discussed above, nothing worked! Is there any other way to free GPU memory. Thanks.
...
264 ActivationLeaky 16 x 16, 512 -> 16 x 16, 512
265 Conv2d 16 x 16, 512 -> 16 x 16, 1024
266 ActivationLeaky 16 x 16, 1024 -> 16 x 16, 1024
267 Conv2d 16 x 16, 1024 -> 16 x 16, 48
268 Yolo 16 x 16, 48 -> 16 x 16, 48
===========================================================
GPU free memory: 504.472 mb.
New NetworkRT (TensorRT v6.01)
Float16 support: 1
Int8 support: 0
DLAs: 0
Selected maxBatchSize: 1
GPU free memory: 443.138 mb.
Building tensorRT cuda engine...
Killed
Actually you can't run int8 on Jetson nano, the hardware doesn't permit it, as said when Network RT starts:
New NetworkRT (TensorRT v6.01)
Float16 support: 1
Int8 support: 0
DLAs: 0
Hi @ceccocats,
Thank you for the feedback. yes I was also wondering why the fps using int8 and fp32 is same :). So one more quesitons is in jetson-nano is the hardware problem or the functions is not implemented yet to support int8.
The int8 inference is not supported by the hardware of Jetson Nano.
Thanks ^^. Closing.
for(int i=0; i<net->num_layers; i++) {
if(net->layers[i]->getLayerType() == tk::dnn::LAYER_CONV2D) {
tk::dnn::Conv2d *c = (tk::dnn::Conv2d*) net->layers[i];
c->releaseDevice();
c->releaseHost(true, false);
}
if(net->layers[i]->dstData != nullptr) {
cudaFree(net->layers[i]->dstData);
net->layers[i]->dstData = nullptr;
}
}
if (ws_sizeInBytes!=0) {
checkCuda( cudaMalloc(&workSpace, ws_sizeInBytes) );
}
The following error occurs when using the previous two modifications. How to solve this problem? Thank you.
GPU free memory: 570.851 mb. New NetworkRT (TensorRT v7.13) Float16 support: 1 Int8 support: 0 DLAs: 0 TENSORRT LOG: Parameter check failed at: ../builder/Network.cpp::addConvolutionNd::718, condition: kernelWeights.values != nullptr Null pointer /home/a/tkDNN/src/NetworkRT.cpp:317 Aborting...
Hi,
Congratulations on your great work.
I am testing tkDNN in my Jetson NANO. Everything was fine. But when I tried to test yolo3/4 in FP16 mode, while building tensorRT cuda engine (after waiting around 40 min), it terminates with message
Killed
.Any help please?
Best, Deepak