PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
21.81k stars 5.48k forks source link

使用PaddleInference推理,推理时内存暴涨 #49500

Open yydan2022 opened 1 year ago

yydan2022 commented 1 year ago

bug描述 Describe the Bug

使用这个模型训练自己的场景数据就会出现推理一瞬间内存暴涨情况,模型加载时内存是正常的。使用模型:https://paddledet.bj.bcebos.com/models/mask_rcnn_r50_vd_fpn_2x_coco.pdparamsimage

推理使用PaddleDetection中的deploy/python/infer.py进行推理,就会出现上图的这个推理瞬间内存出现暴涨的现象。但是同样的实例分割模型同一批训练数据在PaddlePaddle1.8.4框架上训练后,在PaddlePaddle1.8.4上推理显存占用情况是正常的,如下图所示,另外在PaddlePaddle1.8.4框架上训练的模型,也在PaddlePaddle2.3.2以及2.4.2上测试过推理,也是正常的 image

版本&环境信息 Version & Environment Information Paddle version: 2.4.1 Paddle With CUDA: 11.2 PaddleDetection:2.5 CMake version: version 3.25.1 Python version: 3.8.15

其他补充信息 Additional Supplementary Information

No response

paddle-bot[bot] commented 1 year ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

wuyefeilin commented 1 year ago

查看一下是否存在大图输入的情况呢

yydan2022 commented 1 year ago

没有大图了,两种方式的 图片大小都是一样的,720*1280的图

xiaoxiaohehe001 commented 1 year ago

请问显存异常情况是在Paddle 哪个版本推理时出现的,不同版本的 Paddle 都是使用 deploy/python/infer.py 脚本推理的吗,输入的batch大小是多少啊, "模型加载时内存是正常的" 是指什么?

yydan2022 commented 1 year ago

显存异常在PaddlePaddle2.3.2和2.4.1均出现过,训练推理使用的都是一样版本的框架。推理均是使用PaddleDetection2.5分支下的deploy/python/infer.py进行的,batch_size为1。模型加载时内存正常是指只进行load操作,显存大约会占800M左右。

hw446 commented 1 year ago

我也遇到过这种情况,明明输入的图像更小,显存却莫名多了好几个G,后来把config.switch_ir_optim=False就好了。 为了调式原因,我把显存可用限制很小,爆了下面的错误。大概就是config.switch_ir_optim=True会用到CUDNNConvFusionOpKernel,然后这个操作有什么bug,导致显存剧增。试了下2< version <=2.4.2的版本都有这个问题。 -------------------------------------- C++ Traceback (most recent call last):

0 paddle::AnalysisPredictor::ZeroCopyRun() 1 paddle::framework::NaiveExecutor::Run() 2 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, phi::Place const&) 3 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, phi::Place const&) const 4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, phi::Place const&, paddle::framework::RuntimeContext) const 5 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<phi::GPUPlace, false, 0ul, paddle::operators::CUDNNConvFusi onOpKernel, paddle::operators::CUDNNConvFusionOpKernel, paddle::operators::CUDNNConvFusionOpKernel >::operator()(char const, char const, int) c onst::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&) 6 paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const 7 phi::DnnWorkspaceHandle::RunFunc(std::function<void (void)> const&, unsigned long) 8 phi::DnnWorkspaceHandle::ReallocWorkspace(unsigned long) 9 paddle::memory::allocation::Allocator::Allocate(unsigned long) 10 paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long) 11 paddle::memory::allocation::Allocator::Allocate(unsigned long) 12 paddle::memory::allocation::Allocator::Allocate(unsigned long) 13 paddle::memory::allocation::Allocator::Allocate(unsigned long) 14 paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long) 15 std::string phi::enforce::GetCompleteTraceBackString(std::string&&, char const*, int) 16 phi::enforce::GetCurrentTraceBackStringabi:cxx11

hw446 commented 1 year ago

最终发现,config.switch_ir_optim=True其实是有提示下面的错误: W0612 15:14:15.092157 28399 op_compat_sensible_pass.cc:232] Check the Attr(axis) of Op(elementwise_add) in pass(conv_elementwise_add_act_fuse_pass) failed!
W0612 15:14:15.092191 28399 conv_elementwise_add_act_fuse_pass.cc:181] Pass in op compat failed. W0612 15:14:15.129107 28399 op_compat_sensible_pass.cc:232] Check the Attr(axis) of Op(elementwise_add) in pass(conv_elementwise_add_fuse_pass) failed! W0612 15:14:15.129125 28399 conv_elementwise_add_fuse_pass.cc:94] Pass in op compat failed.

只要如下设置(paddle 2.5.0rc),那么就不会再有内存暴涨的情况。 config.delete_pass("conv_elementwise_add_act_fuse_pass") config.delete_pass("conv_elementwise_add_fuse_pass")

FreedomLiX commented 12 months ago

@hw446 推理时内存暴涨,应该是显存! 具体描述:在硬件1060显卡上,调用libpaddle_inference.so 推理时,遇到同样显存暴增的问题,导致程序挂掉。同样程序,其他硬件(1660等)未出现。 解决方式:config.switch_ir_optim=False(不进行IR优化加速)。