PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.26k stars 5.6k forks source link

Deploying multiple models at the same time will raise MKLDNN error #31992

Closed juncaipeng closed 3 years ago

juncaipeng commented 3 years ago

C++ API Ubuntu 16.04 CPU MKLDNN GCC 8.2.0

Please contact danqing to download the demo. The demo only has two group models for test.

Download model_test.cc.zip, unzip model_test.cc.zip, use the new model_test.cc to update the old model_test.cc file in the demo.

Make and install paddle release2.0 (commit id: c7a6a1f9610a9ee018c19d89950d76b38f33aed1).

cmake -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=OFF -DWITH_MKL=ON -DWITH_GPU=OFF -DON_INFER=ON .. && make -j && make inference_lib_dist -j

Set LIB_DIR as the path of PaddleInference in build.sh.

Run sh build.sh.

Run ulimit -c unlimited enable save core file.

Run ./build/model_test --test_groups=0 --single_instance=true , it does not raise error. If set single_instance as true, every model only has one predictor. Otherwise, some models will have several predictors by calling predictor.clone().

Run ./build/model_test --test_groups=0 --single_instance=false, it raises segmentation fault error.
Run gdb ./build/model_test core_file get the following error.

image

Run ./build/model_test --test_groups=1 --single_instance=true , it does not raise error.

Run ./build/model_test --test_groups=1 --single_instance=false, it also raises segmentation fault error.

image

Sometimes, the above demo raises different error, such as

image

Run ./build/model_test --test_groups="4 5 6" --single_instance=true. The demo loads several group models and every model has one predictor, and it also raises error as following. The demo only has two group models for now, we will provide other models later.

image

image image

paddle-bot-old[bot] commented 3 years ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

jczaja commented 3 years ago

@lidanqing-intel , @juncaipeng We are investigating that issue. Currently We reproduced it on develop branch. Candidate fix was made (#32136) and we are testing it now.

jczaja commented 3 years ago

@juncaipeng

  1. Could you please test this PR https://github.com/PaddlePaddle/Paddle/pull/32136 (develop) if this solves problem for you?
  2. We miss some multi-instance unit tests (#32087) , would it be possible to turn this issue's test into UT ?
juncaipeng commented 3 years ago

@jczaja
For the latest develop branch, the demo still raises error when run the following command: ./build/model_test --test_groups="0 1" --single_instance=false Sometimes, the demo also raises error when run the following command: ./build/model_test --test_groups="0" --single_instance=false ./build/model_test --test_groups="1" --single_instance=false

Should I use the inference library of release/2.0?

These models can not be used in UT, so you should find some other models.

jczaja commented 3 years ago

@juncaipeng This is cherry-pick for release/2.0 https://github.com/PaddlePaddle/Paddle/pull/32163 . It works fine on my setup but @lidanqing-intel that every other run there is some crash on her setup so I will test it further.

jczaja commented 3 years ago

@juncaipeng I have made some more changes (develop PR: #32309). Could you please test them and report problems if any?

juncaipeng commented 3 years ago

@jczaja I have tested all the models in the demo and don't have problems. The customer will use the new inference library to test in their project. If there's any news, I'll give it back.

juncaipeng commented 3 years ago

@jczaja The customers reported that there is no problem with the new inference library (develop PR: #32309) for now.

jczaja commented 3 years ago

@juncaipeng I have implemented alternative fix : https://github.com/PaddlePaddle/Paddle/pull/32499 . That is the one I would like to merge. Could you please test it against this issue

juncaipeng commented 3 years ago

@jczaja The inference library (develop PR: #32499) also passed all tests.

lidanqing-intel commented 3 years ago

@juncaipeng Could this issue be closed

paddle-bot-old[bot] commented 3 years ago

Are you satisfied with the resolution of your issue?

YES No