PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.26k stars 5.59k forks source link

paddlex.deploy.Predictor multithreading execute unet will cause RuntimeError: could not execute a primitive #34492

Closed dwSun closed 3 years ago

dwSun commented 3 years ago
import os

import numpy as np
from PIL import Image

from multiprocessing.dummy import Pool as ThreadPool
import paddlex as pdx

model = pdx.deploy.Predictor("path/to/model", use_gpu=False)

def handler(img):
    print("#")
    image = Image.open(img)
    image_data = np.asarray(image)
    # RGB -> BGR
    image_data = image_data[..., ::-1]
    model.predict(img, topk=5)

img_dirs = "/path/to/imgs"
files = []
for r, ds, fs in os.walk(img_dirs):
    for f in fs:
        if f.endswith(".JPG"):
            files.append(os.path.join(r, f))

p = ThreadPool()
p.map(handler, files)
p.close()
p.join()

as above,loading deeplabv3p model does not report error,but loading unet will report error:

  File "test_seg.3.py", line 26, in handler
    model.predict(img, topk=5)
  File "/home/user/miniconda3/envs/tower/lib/python3.8/site-packages/paddlex/deploy.py", line 278, in predict
    model_pred = self.raw_predict(preprocessed_input)
  File "/home/user/miniconda3/envs/tower/lib/python3.8/site-packages/paddlex/deploy.py", line 257, in raw_predict
    self.predictor.zero_copy_run()
RuntimeError: could not execute a primitive
paddle-bot-old[bot] commented 3 years ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

dwSun commented 3 years ago

我注意到有个 log

2021-07-29 21:54:47 [WARNING] HRNet/DeepLabv3p/PPYOLO are not supported for the use of mkldnn

是否跟 mkldnn 有关?

dwSun commented 3 years ago

补充环境: ''' paddle2onnx 0.7 paddlehub 2.1.0 paddlenlp 2.0.6 paddlepaddle 2.1.1 paddleslim 1.1.1 paddlex 1.3.11 '''

$ python summary_env.py


Paddle version: 2.1.1 Paddle With CUDA: False

OS: Ubuntu 20.04 Python version: 3.8.10

CUDA version: None cuDNN version: None.None.None Nvidia driver version: None


hong19860320 commented 3 years ago

你在调用Predictor的时候disable mkldnn试试 model = pdx.deploy.Predictor("path/to/model", use_gpu=False, use_mkl=False)

dwSun commented 3 years ago

你在调用Predictor的时候disable mkldnn试试 model = pdx.deploy.Predictor("path/to/model", use_gpu=False, use_mkl=False)

确实,关掉 mkl 就没再出问题,但是这个速度实在是无法接受。

hong19860320 commented 3 years ago

嗯嗯,看样子是mkl的问题,我找同学跟进下

hong19860320 commented 3 years ago

@baoachun 跟下这个问题吧,谢谢~

baoachun commented 3 years ago

@will-jl944 麻烦帮忙确认下是模型不支持开启mkldnn,还是mkldnn本身有问题呢

FlyingQianMM commented 3 years ago

@dwSun 问题收到,我们先复现一下

FlyingQianMM commented 3 years ago

@will-jl944 麻烦帮忙确认下是模型不支持开启mkldnn,还是mkldnn本身有问题呢

import paddlex as pdx 

model = pdx.deploy.Predictor("inference", use_gpu=False, use_mkl=True)
img = "optic_disc_seg/JPEGImages/P0183.jpg"
res = model.predict(img)
print(res)

使用上述代码测试unet可以正常加载并完成预测。辛苦 @baoachun 看下是否是用户用多个线程里面同时执行self.predictor.zero_copy_run()导致的问题。https://github.com/PaddlePaddle/PaddleX/blob/a58bd177cde63a078b02cf0afe6f52cba0393186/paddlex/deploy.py#L277

FlyingQianMM commented 3 years ago

我注意到有个 log

2021-07-29 21:54:47 [WARNING] HRNet/DeepLabv3p/PPYOLO are not supported for the use of mkldnn

是否跟 mkldnn 有关?

这个日志打印出来的原因是:paddlex 1.3.x之前适配的是paddle 1.8.4/5,此版本下这三个模型还不支持开启mkdnn。当前使用paddle 2.1.1的话,已经可以支持开启mkldnn了,可以参考 https://github.com/PaddlePaddle/PaddleX/pull/1006 https://github.com/PaddlePaddle/PaddleX/pull/1005 直接修改paddlex安装路径下的deploy.py来开启mkdlnn。

baoachun commented 3 years ago

@will-jl944 麻烦帮忙确认下是模型不支持开启mkldnn,还是mkldnn本身有问题呢

import paddlex as pdx 

model = pdx.deploy.Predictor("inference", use_gpu=False, use_mkl=True)
img = "optic_disc_seg/JPEGImages/P0183.jpg"
res = model.predict(img)
print(res)

使用上述代码测试unet可以正常加载并完成预测。辛苦 @baoachun 看下是否是用户用多个线程里面同时执行self.predictor.zero_copy_run()导致的问题。https://github.com/PaddlePaddle/PaddleX/blob/a58bd177cde63a078b02cf0afe6f52cba0393186/paddlex/deploy.py#L277

好的,我复现看下

baoachun commented 3 years ago

@dwSun 请问能分享下你的模型吗?我这边没有合适的模型复现问题

dwSun commented 3 years ago

@dwSun 请问能分享下你的模型吗?我这边没有合适的模型复现问题

抱歉,项目内模型,不太好分享 @FlyingQianMM 那边似乎有一个可用的模型, 已经测试过 unet 单线程的运行

@will-jl944 麻烦帮忙确认下是模型不支持开启mkldnn,还是mkldnn本身有问题呢

import paddlex as pdx 

model = pdx.deploy.Predictor("inference", use_gpu=False, use_mkl=True)
img = "optic_disc_seg/JPEGImages/P0183.jpg"
res = model.predict(img)
print(res)

使用上述代码测试unet可以正常加载并完成预测。辛苦 @baoachun 看下是否是用户用多个线程里面同时执行self.predictor.zero_copy_run()导致的问题。https://github.com/PaddlePaddle/PaddleX/blob/a58bd177cde63a078b02cf0afe6f52cba0393186/paddlex/deploy.py#L277

dwSun commented 3 years ago

我们这边测试,如果在一个进程里面加载多个模型,顺序执行,多执行几遍的时候,也会出现这个问题。

还是 use_mkl = False 的时候就没有问题。

 |
 | --------------------------------------
 | C++ Traceback (most recent call last):
 | --------------------------------------
 | 0   paddle::framework::SignalHandle(char const*, int)
 | 1   paddle::platform::GetCurrentTraceBackString[abi:cxx11]()
 |
 | ----------------------
 | Error Message Summary:
 | ----------------------
 | FatalError: `Segmentation fault` is detected by the operating system.
 |   [TimeInfo: *** Aborted at 1627987076 (unix time) try "date -d @1627987076" if you are using GNU date ***]
 |   [SignalInfo: *** SIGSEGV (@0x7fd9049cb2c0) received by PID 8 (TID 0x7fd866e28700) from PID 77378240 ***]
dwSun commented 3 years ago

请问这个问题有什么进展吗?

lidanqing-intel commented 3 years ago

我们这两天接到这个issue,这两天会看

baoachun commented 3 years ago

@jczaja Hi, I have reproduced the problem, conv2d will report an error when mkldnn multi-threaded prediction is enabled.

图片

Here is the log. mkl.log

You can copy multiple copies of this picture as a data set. a

What's more, you need to replace config.enable_glog_info() in line 134 in the python-3.7.9/lib/python3.7/site-packages/paddlex/deploy.py file with pass, and then set the environment variable GLOG_v=4, you can get detailed error information.

dwSun commented 3 years ago

我注意到另外一个有意思的事情,当我启动 mkl 的时候,可以通过一些节点名称,抽取中间变量的值,但是 关闭 mkl 的时候,就无法抽取,设置的节点名称是中间节点,但是返回的明显是最后一个节点。 同时,测试了一下,使用 gpu ,打开 mkl 的状态下,也无法获取中间节点的值,相应的内容请参考: https://github.com/PaddlePaddle/PaddleX/issues/803

dwSun commented 3 years ago

@baoachun @FlyingQianMM @lidanqing-intel 请问这个问题最近有什么进展吗?无法使用 mkl 加速,导致一张图片在客户机器上要花费3~6秒,这个性能完全不可接受。这个问题已经严重影响到了某个ZF项目的交付。

lidanqing-intel commented 3 years ago

@jczaja Hi, please work in this issue after finishing matmul v2 caching clearing. Thank you !

lidanqing-intel commented 3 years ago

在看了这个issue.已复现

lidanqing-intel commented 3 years ago

@dwSun Hi, this issue is solved with PR #34492 and it is merged, Could you test it ?

lidanqing-intel commented 3 years ago

@dwSun Hi the fix PR has been merged into develop. Please use newest develop to verify. Thanks.