PaddlePaddle / Paddle-Lite

PaddlePaddle High Performance Deep Learning Inference Engine for Mobile and Edge (飞桨高性能深度学习端侧推理引擎)
https://www.paddlepaddle.org.cn/lite
Apache License 2.0
6.96k stars 1.61k forks source link

使用paddleocr的量化后的识别模型,在armv7hf系统下使用cpu多线程推理出现段错误 #10179

Open luoqianlin opened 1 year ago

luoqianlin commented 1 year ago

使用paddleocr的量化后的识别模型,在armv7hf系统下使用cpu多线程推理出现Segmentation fault

版本、预测库信息:

   1)Paddle Lite 版本:v2.12    2)Host 环境:linux armv7hf    3)运行设备环境:爱芯620a    4)预测后端信息:CPU

预测信息

   1)预测 API:C++    2)预测选项信息:armv7多线程(3线程或4线程时出现段错误)    3)预测库来源:源码编译,命令行参数为./lite/tools/build_linux.sh --arch=armv7hf --with_extra=ON --with_cv=ON

复现信息:

代码和操作按照paddleocr端侧部署文档提供 执行命令:

./ocr_db_crnn rec models/ch_PP-OCRv3_rec_slim_opt.nb armv7hf INT8 4 1 ../tmp/img_8.jpg models/ppocr_keys_v1.txt models/config.txt 报错为:

Segmentation fault

问题描述:

单线程不会出现问题,使用3线程或4线程时很容易出现该问题 运行日志如下:

/ax620a/paddle-lite # ./ocr_db_crnn rec  models/ch_PP-OCRv3_rec_slim_opt.nb armv7hf INT8 4 1 ../tmp/img_8.jpg  models/ppocr_keys_v1.txt models/config.txt
mode: rec
[I  1/29  6: 6:24.910 ...ild/paddle-lite/lite/core/device_info.cc:282 get_cpu_arch] Unknow cpu arch: 3079
[I  1/29  6: 6:24.910 ...ild/paddle-lite/lite/core/device_info.cc:282 get_cpu_arch] Unknow cpu arch: 3079
[I  1/29  6: 6:24.910 ...ild/paddle-lite/lite/core/device_info.cc:282 get_cpu_arch] Unknow cpu arch: 3079
[I  1/29  6: 6:24.910 ...ild/paddle-lite/lite/core/device_info.cc:282 get_cpu_arch] Unknow cpu arch: 3079
[I  1/29  6: 6:24.913 ...ild/paddle-lite/lite/core/device_info.cc:1275 Setup] ARM multiprocessors name: MODEL NAME  : ARMV7 PROCESSOR REV 5 (V7L)
HARDWARE    : GENERIC DT BASED SYSTEM

[I  1/29  6: 6:24.913 ...ild/paddle-lite/lite/core/device_info.cc:1276 Setup] ARM multiprocessors number: 4
[I  1/29  6: 6:24.913 ...ild/paddle-lite/lite/core/device_info.cc:1278 Setup] ARM multiprocessors ID: 0, max freq: 1248, min freq: 1248, cluster ID: 0, CPU ARCH: A-1
[I  1/29  6: 6:24.913 ...ild/paddle-lite/lite/core/device_info.cc:1278 Setup] ARM multiprocessors ID: 1, max freq: 1248, min freq: 1248, cluster ID: 0, CPU ARCH: A-1
[I  1/29  6: 6:24.913 ...ild/paddle-lite/lite/core/device_info.cc:1278 Setup] ARM multiprocessors ID: 2, max freq: 1248, min freq: 1248, cluster ID: 0, CPU ARCH: A-1
[I  1/29  6: 6:24.913 ...ild/paddle-lite/lite/core/device_info.cc:1278 Setup] ARM multiprocessors ID: 3, max freq: 1248, min freq: 1248, cluster ID: 0, CPU ARCH: A-1
[I  1/29  6: 6:24.913 ...ild/paddle-lite/lite/core/device_info.cc:1284 Setup] L1 DataCache size is: 
[I  1/29  6: 6:24.914 ...ild/paddle-lite/lite/core/device_info.cc:1286 Setup] 32 KB
[I  1/29  6: 6:24.914 ...ild/paddle-lite/lite/core/device_info.cc:1286 Setup] 32 KB
[I  1/29  6: 6:24.914 ...ild/paddle-lite/lite/core/device_info.cc:1286 Setup] 32 KB
[I  1/29  6: 6:24.914 ...ild/paddle-lite/lite/core/device_info.cc:1286 Setup] 32 KB
[I  1/29  6: 6:24.914 ...ild/paddle-lite/lite/core/device_info.cc:1288 Setup] L2 Cache size is: 
[I  1/29  6: 6:24.914 ...ild/paddle-lite/lite/core/device_info.cc:1290 Setup] 512 KB
[I  1/29  6: 6:24.914 ...ild/paddle-lite/lite/core/device_info.cc:1290 Setup] 512 KB
[I  1/29  6: 6:24.914 ...ild/paddle-lite/lite/core/device_info.cc:1290 Setup] 512 KB
[I  1/29  6: 6:24.914 ...ild/paddle-lite/lite/core/device_info.cc:1290 Setup] 512 KB
[I  1/29  6: 6:24.914 ...ild/paddle-lite/lite/core/device_info.cc:1292 Setup] L3 Cache size is: 
[I  1/29  6: 6:24.914 ...ild/paddle-lite/lite/core/device_info.cc:1294 Setup] 0 KB
[I  1/29  6: 6:24.914 ...ild/paddle-lite/lite/core/device_info.cc:1294 Setup] 0 KB
[I  1/29  6: 6:24.914 ...ild/paddle-lite/lite/core/device_info.cc:1294 Setup] 0 KB
[I  1/29  6: 6:24.914 ...ild/paddle-lite/lite/core/device_info.cc:1294 Setup] 0 KB
[I  1/29  6: 6:24.914 ...ild/paddle-lite/lite/core/device_info.cc:1296 Setup] Total memory: 245308KB
[I  1/29  6: 6:24.915 ...ild/paddle-lite/lite/core/device_info.cc:1297 Setup] SVE2 support: 0
[I  1/29  6: 6:24.915 ...ild/paddle-lite/lite/core/device_info.cc:1298 Setup] SVE2 f32mm support: 0
[I  1/29  6: 6:24.915 ...ild/paddle-lite/lite/core/device_info.cc:1299 Setup] SVE2 i8mm support: 0
The predict img: ../tmp/img_8.jpg
0   【净含量】:220ml 0.975608
Segmentation fault

分析

重新编译Debug版本,使用Valgrind分析发现有越界的内存写操作

==26664== Invalid write of size 4
==26664==    at 0x484BA54: memset (vg_replace_strmem.c:1374)
==26664==    by 0x48AB2D1: void paddle::lite::arm::math::conv_compute_2x2_3x3_int8<signed char>(signed char const*, signed char*, int, int, int, int, int, int, int, short const*, float const*, float const*, paddle::lite::operators::ConvParam const&, paddle::lite::Context<(paddle::lite_api::TargetType)4>*) [clone ._omp_fn.0] [clone .lto_priv.5522] (conv3x3_winograd_int8.cc:227)
==26664==    by 0x4D1D275: GOMP_parallel (parallel.c:168)
==26664==    by 0x498A853: conv_compute_2x2_3x3_int8 (conv3x3_winograd_int8.cc:173)
==26664==    by 0x498A853: paddle::lite::kernels::arm::WinogradConv<(paddle::lite_api::PrecisionType)2, (paddle::lite_api::PrecisionType)1>::Run() (conv_winograd.cc:341)
==26664==    by 0x4984F69: paddle::lite::kernels::arm::ConvCompute<(paddle::lite_api::PrecisionType)2, (paddle::lite_api::PrecisionType)1>::Run() (conv_compute.h:39)
==26664==    by 0x48FAFBB: Run (program.cc:797)
==26664==    by 0x48FAFBB: paddle::lite::RuntimeProgram::Run() (program.cc:610)
==26664==    by 0x49D97F7: Run (light_api.h:71)
==26664==    by 0x49D97F7: paddle::lite::LightPredictorImpl::Run() (light_api_impl.cc:132)
==26664==    by 0x2D40B: RunRecModel(std::vector<std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >, std::allocator<std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > > >, cv::Mat, std::shared_ptr<paddle::lite_api::PaddlePredictor>, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::vector<float, std::allocator<float> >&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::shared_ptr<paddle::lite_api::PaddlePredictor>, int, std::vector<double, std::allocator<double> >*, int) (ocr_db_crnn.cc:177)
==26664==    by 0x3096F: rec(int, char**) (ocr_db_crnn.cc:588)
==26664==    by 0x310A1: main (ocr_db_crnn.cc:627)

增加临时内存的分配(粗暴修改,没有精确考虑算子具体需要多少临时内存),问题得到临时修复

修改代码在这里

zhupengyang commented 9 months ago

可能是 ctx 是单例导致的。这种情况可以尝试用多进程去推理