PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.12k stars 5.55k forks source link

Jetson NX上TensorRT8预测报错 #34765

Closed jedibobo closed 1 year ago

jedibobo commented 3 years ago

-预测信息    1)C++预测:自己编译的Jetson预测库,修改PaddleDetection的deploy/cpp下的文件    2)CMake包含路径的完整命令 以下是编译这个工程的命令,或者看这个GitHub,我自己上传的,只是修改了编译脚本(selfbuild.sh)。

是否使用GPU(即是否使用 CUDA)

WITH_GPU=ON

是否使用MKL or openblas,TX2需要设置为OFF

WITH_MKL=OFF

是否集成 TensorRT(仅WITH_GPU=ON 有效)

WITH_TENSORRT=ON

paddle 预测库lib名称,由于不同平台不同版本预测库lib名称不同,请查看所下载的预测库中paddle_inference/lib/文件夹下lib的名称

PADDLE_LIB_NAME=libpaddle_inference

ensorRT 的include路径

TENSORRT_INC_DIR=/usr/include/aarch64-linux-gnu

TensorRT 的lib路径

TENSORRT_LIB_DIR=/usr/lib/aarch64-linux-gnu

addle 预测库路径

PADDLE_DIR=/home/lyb/build_github/Paddle/TRT8-develop-build_cuda/paddle_inference_install_dir

CUDA 的 lib 路径

CUDA_LIB=/usr/local/cuda/lib64

CUDNN 的 lib 路径

CUDNN_LIB=/usr/lib/aarch64-linux-gnu

rm -rf build mkdir -p build cd build cmake .. \ -DWITH_GPU=${WITH_GPU} \ -DWITH_MKL=${WITH_MKL} \ -DWITH_TENSORRT=${WITH_TENSORRT} \ -DTENSORRT_LIB_DIR=${TENSORRT_LIB_DIR} \ -DTENSORRT_INC_DIR=${TENSORRT_INC_DIR} \ -DPADDLE_DIR=${PADDLE_DIR} \ -DWITH_STATIC_LIB=${WITH_STATIC_LIB} \ -DCUDA_LIB=${CUDA_LIB} \ -DCUDNN_LIB=${CUDNN_LIB} \ -DPADDLE_LIB_NAME=${PADDLE_LIB_NAME} make -j6 echo "make finished!"    3)API信息(如调用请提供)    4)预测库来源:自己编译

./build/main --model_dir=./ppyolov2_r50vd_dcn_365e_coco --image_file=/home/lyb/code/Paddle-TRT-TEST/imgs/1a.jpg --run_mode=trt_fp16 --use_gpu=1 --run_benchmark WARNING: Logging before InitGoogleLogging() is written to STDERR E0810 16:18:02.041721 16175 helper.h:95] Tactic Device request: 284MB Available: 134MB. Device memory is insufficient to use tactic. E0810 16:18:02.147250 16175 helper.h:95] Tactic Device request: 284MB Available: 134MB. Device memory is insufficient to use tactic. E0810 16:18:02.680083 16175 helper.h:95] Tactic Device request: 280MB Available: 133MB. Device memory is insufficient to use tactic. E0810 16:18:02.722748 16175 helper.h:95] Tactic Device request: 280MB Available: 134MB. Device memory is insufficient to use tactic. terminate called after throwing an instance of 'std::length_error' what(): vector::_M_default_append fish: “./build/main --model_dir=./ppyo…” terminated by signal SIGABRT (Abort)

paddle-bot-old[bot] commented 3 years ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

fengxiaoshuai commented 3 years ago

cmake .. -DWITH_CONTRIB=OFF -DWITH_MKL=OFF -DWITH_MKLDNN=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DON_INFER=ON -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_XBYAK=OFF -DWITH_NV_JETSON=ON -DWITH_TENSORRT=ON -DTENSORRT_ROOT=/usr -DCMAKE_CUDA_COMPILER=/usr/local/cuda-10.2/bin/nvcc -DWITH_NCCL=OFF -DCUDA_ARCH_NAME=All -DCMAKE_CXX_FLAGS="-Wno-error -w" 尝试这个编译选项

jedibobo commented 3 years ago

您好,我的编译指令只有最后一个和您的不一样。之前paddle编译这块我在很多平台都成功过,应该不是编译出的问题我认为。这个在jetson上比较耗时,我先试一下,可能比较晚回复。

jedibobo commented 3 years ago

cmake .. -DWITH_CONTRIB=OFF -DWITH_MKL=OFF -DWITH_MKLDNN=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DON_INFER=ON -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_XBYAK=OFF -DWITH_NV_JETSON=ON -DWITH_TENSORRT=ON -DTENSORRT_ROOT=/usr -DCMAKE_CUDA_COMPILER=/usr/local/cuda-10.2/bin/nvcc -DWITH_NCCL=OFF -DCUDA_ARCH_NAME=All -DCMAKE_CXX_FLAGS="-Wno-error -w" 尝试这个编译选项

您好,我今早编译成了。试了一下,报错还是原来的那样。还得麻烦帮忙定位下问题,我觉得应该是paddle-trt在trt8的问题,谢谢。

fengxiaoshuai commented 3 years ago

把trt的申请内存设置调整大一点或者小一点试一下,看看还会出什么错,同时把先显存池的配置取消试试,如下,config.enable_use_gpu(0), 并且打开内存优化设置。 如果库没问题话就是jetson的内存和显存捉襟见肘的问题,显存只能东挪西借的看看能不能解决。

wangye707 commented 3 years ago

@jedibobo 你好,官方暂未正式支持jetpack4.6,后续版本将支持

jedibobo commented 3 years ago

把trt的申请内存设置调整大一点或者小一点试一下,看看还会出什么错,同时把先显存池的配置取消试试,如下,config.enable_use_gpu(0), 并且打开内存优化设置。 如果库没问题话就是jetson的内存和显存捉襟见肘的问题,显存只能东挪西借的看看能不能解决。

您这个猜想非常有道理,我把显存配置改变会出现不同的结果。在1<<x,x<31时都会爆出之前的问题。x=31时,这边出现了以下报错: (我运行了两次,连续的) CPU Mem Optim is: 1 Profile is: 1 WARNING: Logging before InitGoogleLogging() is written to STDERR E0811 15:37:17.277021 21151 helper.h:95] 10: [optimizer.cpp::computeCosts::1855] Error Code 10: Internal Error (Could not find any implementation for node PWN(PWN(tanh (Output: tanh_48.tmp_01855)), elementwise (Output: tmp_481857)).) E0811 15:37:17.277294 21151 helper.h:95] 2: [builder.cpp::buildSerializedNetwork::417] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed.) Segmentation fault (core dumped)

lyb@lyb-nx:~/code/Paddle-cpp-deploy$ sh ./scripts/run.sh CPU Mem Optim is: 1 Profile is: 1 WARNING: Logging before InitGoogleLogging() is written to STDERR E0811 15:39:03.444169 2041 helper.h:95] 4: [pluginV2Builder.cpp::makeRunner::680] Error Code 4: Internal Error (Internal error: plugin node slice (Output: reshape2_6.tmp_0_slice_12015) requires 0 bytes of scratch space, but only -2147483648 is available. Try increasing the workspace size with IBuilderConfig::setMaxWorkspaceSize(). ) E0811 15:39:03.444402 2041 helper.h:95] 2: [builder.cpp::buildSerializedNetwork::417] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed.) Segmentation fault (core dumped) 一次提示没找到实现?另一次就是现存不够了吧?之后我搜了下这个问题,在这里发现了https://github.com/NVIDIA/TensorRT/issues/209#issue-520704336 但是我感觉很离谱的是,python的实现这里现存设置是1<<20,能正常跑。但是cpp这边不行,不管怎么调,我在拿jtop观察内存的时候有一个共性的特点是:大概占用从约2G涨到4.6G的时候,就会core dumped。

jedibobo commented 3 years ago

然后我试了下把TRT配置关闭(其实就和fluid模式没区别了吧),是没有问题的。只要打开config.EnableTensorRTEngine,就会出现问题。但是没找到这个函数的具体实现。。不知道是不是这里的问题,因为https://github.com/NVIDIA/TensorRT/issues/209#issue-520704336这个issue确实提到过显存大小的问题,也有提到layer可能unimplemented的情况。

所以我又去试了一个ppdet的yolov3-darknet53。报错也是没有找到implementation WARNING: Logging before InitGoogleLogging() is written to STDERR E0811 16:21:47.810286 12671 helper.h:95] 10: [optimizer.cpp::computeCosts::1855] Error Code 10: Internal Error (Could not find any implementation for node leaky_relu (Output: leaky_relu_57.tmp_0922).) E0811 16:21:47.811632 12671 helper.h:95] 2: [builder.cpp::buildSerializedNetwork::417] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed.) Segmentation fault (core dumped)

有点懵了。

jedibobo commented 3 years ago

我通过init 5把显示关了多了大概1g显存。现在设置1<<32可以跑了,也能生成正常的trt_serilized model,但是问题在于预测的时候,GPU的占用完全为0。优化模型的时候,GPU有时的频率非常低(几百MHz的)。预测时间大概700ms,不懂为啥。