PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.11k stars 5.55k forks source link

paddle-deepmd inference with FP32 MKLDNN report "no kernel registered in tanh_grad operator" #45058

Closed lidanqing-intel closed 1 year ago

lidanqing-intel commented 2 years ago

bug描述 Describe the Bug

The PaddleScience subproject paddle-deepmd now could infer with paddle_inference using CPU native double64 ops. The details is in https://github.com/X4Science/paddle-deepmd.

Above repo integrated paddle-deepmd with lammps and paddle_inference, a bit complicated to reproduce, for simplicity, you can reproduce the bug with https://github.com/lidanqing-intel/deep_md_test, this repo generate the same effect as X4Science/paddle-deepmd now.

Now Baidu would like to speed it up and also increase its scalability on Intel or Xeon architectures. Which could be implemented by enabling FP32 mkldnn inference for paddle-deepmd.

However, when turn on mkldnn by config->enable_mkldnn(). It crashed on unregister kernel for tanh_grad op.

MicrosoftTeams-image (16)

paddle-bot[bot] commented 2 years ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

zhwesky2010 commented 2 years ago

@lidanqing-intel is it need to add tanh_grad in paddle/fluid/operators/mkldnn/activation_mkldnn_op.cc?

Aganlengzi commented 2 years ago

@lidanqing-intel is it need to add tanh_grad in paddle/fluid/operators/mkldnn/activation_mkldnn_op.cc?

in https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/phi/kernels/onednn/activation_grad_kernel.cc#L251 only bf16 and float32 are registered now:

image
tsocha commented 2 years ago

@zhouwei25 @Aganlengzi Hi, I tried to reproduce the problem from this issue but after carefully following instructions from here: https://github.com/lidanqing-intel/deep_md_test I got such error:

I0921 17:59:09.871930 2304937 naive_executor.cc:110] ---  skip [feed], feed -> default_mesh
I0921 17:59:09.871953 2304937 naive_executor.cc:110] ---  skip [feed], feed -> box
I0921 17:59:09.871954 2304937 naive_executor.cc:110] ---  skip [feed], feed -> natoms_vec
I0921 17:59:09.871956 2304937 naive_executor.cc:110] ---  skip [feed], feed -> type
I0921 17:59:09.871959 2304937 naive_executor.cc:110] ---  skip [feed], feed -> coord
terminate called after throwing an instance of 'phi::enforce::EnforceNotMet'
  what():

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle_infer::CreatePredictor(paddle::AnalysisConfig const&)
1   paddle_infer::Predictor::Predictor(paddle::AnalysisConfig const&)
2   std::unique_ptr<paddle::PaddlePredictor, std::default_delete<paddle::PaddlePredictor> > paddle::CreatePaddlePredictor<paddle::AnalysisConfig, (paddle::PaddleEngineKind)2>(paddle::AnalysisConfig const&)
3   paddle::AnalysisPredictor::Init(std::shared_ptr<paddle::framework::Scope> const&, std::shared_ptr<paddle::framework::ProgramDesc> const&)
4   paddle::AnalysisPredictor::PrepareExecutor()
5   paddle::framework::NaiveExecutor::Prepare(paddle::framework::Scope*, paddle::framework::ProgramDesc const&, int, bool)
6   paddle::framework::NaiveExecutor::CreateOps(paddle::framework::ProgramDesc const&, int, bool)
7   paddle::framework::OpRegistry::CreateOp(paddle::framework::OpDesc const&)
8   paddle::framework::OpRegistry::CreateOp(std::string const&, std::map<std::string, std::vector<std::string, std::allocator<std::string > >, std::less<std::string >, std::allocator<std::pair<std::string const, std::vector<std::string, std::allocator<std::string > > > > > const&, std::map<std::string, std::vector<std::string, std::allocator<std::string > >, std::less<std::string >, std::allocator<std::pair<std::string const, std::vector<std::string, std::allocator<std::string > > > > > const&, paddle::framework::AttributeMap const&, bool)
9   paddle::framework::OpInfoMap::Get(std::string const&) const
10  phi::enforce::EnforceNotMet::EnforceNotMet(phi::ErrorSummary const&, char const*, int)
11  phi::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
NotFoundError: Operator (prod_env_mat_a) is not registered.
  [Hint: op_info_ptr should not be null.] (at /home/tsocha/dev/tmp/Paddle/paddle/fluid/framework/op_info.h:155)

Aborted (core dumped)

It seems that a custom operator(prod_env_mat_a) definied by deepmd is not visible in PaddlePaddle. Do you know how can I register it?

PS I had to modify Danqing's repository to fix compilation error and replace hers paths with mine: https://github.com/tsocha/deep_md_test/commit/9eca57131f43078f1167270c160469e6bf86c2b8

DaisyXten commented 2 years ago

@tsocha Hi, Tomasz, Michael contacted me and ask who is the right person to consult this question.

@leeleolay is working on deepmd. And in Q2 leo @leeleolay and I were working together to make the native inference work. You can discuss with him, he is now still trying on making mkldnn inference work. He is familiar with the details

About the reproduction errors, it is weird cause native inference (without mkldnn) worked for me. Could you confirm that you

  1. used exactly c91aaced74aa1a34c8bde2e53b3072baf8012e73 commit?
  2. turned off config->EnableMKLDNN();), but still get prod_env_mat_a not registered error?
tsocha commented 2 years ago

@leeleolay could you help me with my issue? https://github.com/PaddlePaddle/Paddle/issues/45058#issuecomment-1253914028

commit c91aaced74aa1a34c8bde2e53b3072baf8012e73 (HEAD -> walker)
Author: kuizhiqing <kuizhiqing@baidu.com>
Date:   Mon Aug 8 21:09:42 2022 +0800

    [LAUNCH] make launch Compatible (#44881)

    * make launch compatible

    * fix ut

    * fix log offset

I didn't changed anything in your repository.(except paths and removed BOOST dependency which was unused AFAIK) MKLDNN is not enabled yet. obraz

leeleolay commented 2 years ago

@tsocha Hi,tsocha,I think you can reporoduce work before by following repo link and reference the installation guide by the readme from this repo(https://github.com/X4Science/paddle-deepmd).

a custom operator(prod_env_mat_a) is located in paddle-deepmd repo [/source/op/paddle_ops/srcs], please using 'python3 deepmd/load_paddle_op.py install' to install the custom op

this repo(https://github.com/lidanqing-intel/deep_md_test) is simple inference demo instead of LAMMPS part, the 3 custom ops is located in above-mentioned repo and they need to be compiled. I suggest that you install the all software(Paddle,DeepMD-kit,LAMMPS) in order to prepare whole environment. Later, you can use the repo (https://github.com/lidanqing-intel/deep_md_test) to fix the bug quickly.

tsocha commented 2 years ago

@tsocha Hi,tsocha,I think you can reporoduce work before by following repo link and reference the installation guide by the readme from this repo(https://github.com/X4Science/paddle-deepmd).

a custom operator(prod_env_mat_a) is located in paddle-deepmd repo [/source/op/paddle_ops/srcs], please using 'python3 deepmd/load_paddle_op.py install' to install the custom op

this repo(https://github.com/lidanqing-intel/deep_md_test) is simple inference demo instead of LAMMPS part, the 3 custom ops is located in above-mentioned repo and they need to be compiled. I suggest that you install the all software(Paddle,DeepMD-kit,LAMMPS) in order to prepare whole environment. Later, you can use the repo (https://github.com/lidanqing-intel/deep_md_test) to fix the bug quickly.

@leeleolay

I think I already followed these steps. It's my history listing:

# initial work, prepare directory and virtualenv, cmake version 3.24.1 from pip
mkdir tmp
cd tmp/
virtualenv -p python3 .venv
source .venv/bin/activate
pip install cmake

# compile PaddlePaddle
git clone https://github.com/PaddlePaddle/Paddle.git
cd Paddle/
git checkout c91aaced
mkdir build
cd build/
cmake .. -DCMAKE_BUILD_TYPE=Debug -DWITH_GPU=OFF -DWITH_AVX=ON -DWITH_MKLDNN=ON -DON_INFER=ON -DWITH_TESTING=OFF -DWITH_INFERENCE_API_TEST=OFF -DWITH_NCCL=OFF -DWITH_PYTHON=OFF -DWITH_LITE=OFF -DWITH_ONNXRUNTIME=OFF -DWITH_XBYAK=OFF -DWITH_RCCL=OFF -DWITH_CRYPTO=OFF
make -j`nproc` all

# compile paddle-deepmd
cd /home/tsocha/dev/tmp/
export PADDLE_ROOT=/home/tsocha/dev/tmp/Paddle/build/paddle_inference_install_dir

mkdir deepmdroot
export DEEPMD_ROOT=/home/tsocha/dev/tmp/deepmdroot

git clone https://github.com/X4Science/paddle-deepmd.git
cd paddle-deepmd/source
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=$DEEPMD_ROOT -DPADDLE_ROOT=$PADDLE_ROOT -DUSE_CUDA_TOOLKIT=FALSE -DFLOAT_PREC=low ..
make -j
make install
make lammps

# compile deep_md_test
cd /home/tsocha/dev/tmp/
git clone https://github.com/tsocha/deep_md_test
cd deep_md_test/
bash ./run.sh
./infer_test

the only way that is different from https://github.com/X4Science/paddle-deepmd is that I don't use Python because i work on c++ application. I can see that custom op is compiled as a shared library: ./deepmdroot/lib/libpd_infer_custom_op.so

I have some questions:

  1. Can you see any mistake that I made?
  2. Is python installation important for c++ runner?
leeleolay commented 2 years ago

For running training part of deepmd,python is needed.For running inference part of deepmd,c++ is sufficient. But the custom ops is installed by Python using load_paddle_op.py script, whose command of installation is located in the last line of installing training part of deepmd in the readme.

So I think you may have not installed the custom ops yet.

I am not sure that compiling custom ops directly by c++ can work well with paddle.

tsocha commented 2 years ago

For running training part of deepmd,python is needed.For running inference part of deepmd,c++ is sufficient. But the custom ops is installed by Python using load_paddle_op.py script, whose command of installation is located in the last line of installing training part of deepmd in the readme.

So I think you maybe have not installed the custom ops yet.

I am not sure that compiling custom ops directly by c++ can work well with paddle.

I see now, thanks!!!

tsocha commented 2 years ago

@leeleolay I have no problems with custom op anymore. but it seems that I need to provide these files for input data:

"data_convert/coord.bin"       
"data_convert/type.bin"        
"data_convert/natoms_vec.bin"  
"data_convert/box.bin"         
"data_convert/default_mesh.bin"

Do you know where can I get these files?

leeleolay commented 2 years ago

@tsocha I upload some files to this repo, I think you can reference it https://github.com/leeleolay/deep_md_test

tsocha commented 1 year ago

@leeleolay it seems that problem with missing tanh_grad operator does not appear on newer PaddlePaddle version. I tested it on paddle v2.3.2. Could you verify it?

Unfortunately I can see another problem on this version:

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle_infer::Predictor::Run()
1   paddle::AnalysisPredictor::ZeroCopyRun()
2   paddle::framework::NaiveExecutor::Run()
3   paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, phi::Place const&)
4   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, phi::Place const&) const
5   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, phi::Place const&, paddle::framework::RuntimeContext*) const
6   paddle::framework::OperatorWithKernel::InnerGetExpectedKernelType(paddle::framework::ExecutionContext const&) const
7   paddle::operators::ShapeOp::GetExpectedKernelType(paddle::framework::ExecutionContext const&) const
8   paddle::framework::OperatorWithKernel::IndicateVarDataType(paddle::framework::ExecutionContext const&, std::string const&) const
9   phi::enforce::EnforceNotMet::EnforceNotMet(phi::ErrorSummary const&, char const*, int)
10  std::string phi::enforce::GetTraceBackString<std::string >(std::string&&, char const*, int)
11  phi::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
InvalidArgumentError: The Input Variable(Input) of (shape) Operator used to determine kernel data type is empty or not LoDTensor or SelectedRows or LoDTensorArray.
  [Hint: Expected data_type != dafault_data_type, but received data_type:-1 == dafault_data_type:-1.] (at /home/tsocha/dev/tmp/Paddle/paddle/fluid/framework/operator.cc:2189)
  [operator < shape > error]
Aborted (core dumped)

It's under investigation now.

tsocha commented 1 year ago

I was able to run this demo via oneDNN on this SHA: e1e0deed64 it's a little bit older than SHA used by you before. Please verify if this SHA works for you. I will investigate why It don't work for your original SHA.

tsocha commented 1 year ago

The problem has been caused by #44365 PR. SHA: https://github.com/PaddlePaddle/Paddle/commit/2dfa88d2526953bec87507d87402c4038dc49259 The last working commit is: #44847 PR. SHA: https://github.com/PaddlePaddle/Paddle/commit/f419e341 @YuanRisheng ☝

tsocha commented 1 year ago

@leeleolay I was able to run deep_md_demo on https://github.com/PaddlePaddle/Paddle/commit/2dfa88d2526953bec87507d87402c4038dc49259 and https://github.com/PaddlePaddle/Paddle/commit/c91aaced74aa1a34c8bde2e53b3072baf8012e73 after cherry pick of https://github.com/PaddlePaddle/Paddle/commit/23def39672 Unfortunately you can't use https://github.com/PaddlePaddle/Paddle/commit/23def39672 directly because there is another problem not connected to oneDNN:

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle_infer::Predictor::Run()
1   paddle::AnalysisPredictor::ZeroCopyRun()
2   paddle::framework::NaiveExecutor::Run()
3   paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, phi::Place const&)
4   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, phi::Place const&) const
5   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, phi::Place const&, paddle::framework::RuntimeContext*) const
6   std::function<void (paddle::framework::ExecutionContext const&)>::operator()(paddle::framework::ExecutionContext const&) const
7   phi::enforce::EnforceNotMet::EnforceNotMet(phi::ErrorSummary const&, char const*, int)
8   phi::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
ExternalError: number of samples should match
  [/home/tsocha/dev/tmp/paddle-deepmd/source/op/paddle_ops/srcs/pd_prod_force_se_a_multi_devices_cpu.cc:70] (at /home/tsocha/dev/tmp/Paddle/paddle/fluid/framework/custom_operator.cc:302)
  [operator < prod_force_se_a > error]

above error is thrown with disabled oneDNN as well.

leeleolay commented 1 year ago

this weekend i test the version of paddle you pointed out with deepmd and indemnity. I will reply the result next week. Thanks for your debugging

leeleolay commented 1 year ago

@tsocha I have virified the demo with PADDLE SHA: e1e0deed64 ,it works well. Thanks for your effort. From the ISSUE you mentioned here, I guess the reason is that the activation_kernel of mkldnn from Fluid to PHI?

tsocha commented 1 year ago

This issue is already fixed on develop(one pass was broken): https://github.com/PaddlePaddle/Paddle/issues/45058#issuecomment-1266617198 This issue is probably caused by demo or deep_md data(I'm not sure about that): https://github.com/PaddlePaddle/Paddle/issues/45058#issuecomment-1269876611

The original one was caused by two options:

Can we close this issue?

Aganlengzi commented 1 year ago

I close this issue for fixed as talked above, feel free to reopen if necessary.

leeleolay commented 1 year ago

I test the commit https://github.com/PaddlePaddle/Paddle/commit/23def39672 ,it doesn't work. My error reported is following listed here:

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle_infer::Predictor::Run()
1   paddle::AnalysisPredictor::ZeroCopyRun()
2   paddle::framework::NaiveExecutor::Run()
3   paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, phi::Place const&)
4   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, phi::Place const&) const
5   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, phi::Place const&, paddle::framework::RuntimeContext*) const
6   std::function<void (paddle::framework::ExecutionContext const&)>::operator()(paddle::framework::ExecutionContext const&) const
7   phi::enforce::EnforceNotMet::EnforceNotMet(phi::ErrorSummary const&, char const*, int)
8   phi::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
ExternalError: function "pd_prod_env_mat_a_cpu_forward_kernel" is not implemented for data type `float32`
  [/home/paddle-deepmd/source/op/paddle_ops/srcs/pd_prod_env_mat_multi_devices_cpu.cc:269] (at /home/Paddle/paddle/fluid/framework/custom_operator.cc:302)
  [operator < prod_env_mat_a > error]
Aborted (core dumped)

So, I using the Paddle SHA:https://github.com/PaddlePaddle/Paddle/commit/e1e0deed64bd879357b9fc28ff68770f8eae87a6 to compile DeepMD-kit and LAMMPS , and I revised the code (to enable OneDNN)in the init function of paddle-deepmd/source/api_cc/src/PaddleDeepPot.cc. the pipeline of this softwares can compile smoothly. But the result of inference using LAMMPS is wrong. I think the reason mayby relates to the format of data interface when I use the OneDNN.

yaomichael commented 1 year ago

reopen it as there is still something to clarify

yaomichael commented 1 year ago

I test the commit 23def39672 ,it doesn't work. My error reported is following listed here:

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle_infer::Predictor::Run()
1   paddle::AnalysisPredictor::ZeroCopyRun()
2   paddle::framework::NaiveExecutor::Run()
3   paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, phi::Place const&)
4   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, phi::Place const&) const
5   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, phi::Place const&, paddle::framework::RuntimeContext*) const
6   std::function<void (paddle::framework::ExecutionContext const&)>::operator()(paddle::framework::ExecutionContext const&) const
7   phi::enforce::EnforceNotMet::EnforceNotMet(phi::ErrorSummary const&, char const*, int)
8   phi::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
ExternalError: function "pd_prod_env_mat_a_cpu_forward_kernel" is not implemented for data type `float32`
  [/home/paddle-deepmd/source/op/paddle_ops/srcs/pd_prod_env_mat_multi_devices_cpu.cc:269] (at /home/Paddle/paddle/fluid/framework/custom_operator.cc:302)
  [operator < prod_env_mat_a > error]
Aborted (core dumped)

So, I using the Paddle SHA:e1e0dee to compile DeepMD-kit and LAMMPS , and I revised the code (to enable OneDNN)in the init function of paddle-deepmd/source/api_cc/src/PaddleDeepPot.cc. the pipeline of this softwares can compile smoothly. But the result of inference using LAMMPS is wrong. I think the reason mayby relates to the format of data interface when I use the OneDNN.

on behalf of @tsocha it seems that the newest issue is not caused by paddle, but by custom operator registered by deep_md this is a name of this custom operator, it seems that it was registered but no kernel is available.

That's weird that it works on older SHA, I think something changed in op register mechanism.

@leeleolay can you try to run this demo on current develop branch? maybe this issue is already resolved.

leeleolay commented 1 year ago

The target of this work is to using multi-threads with Paddle using OneDNN to speed up the process of inference

The pipeline of the inference is combining with Paddle, DeepMD-kit(training and inference, is one step of molecular dynamics simulation), LAMMPS(Molecular Dynamics simulation). deep_md_test repo is only used to test whether paddle can work well.

I using the Paddle SHA:e1e0dee to compile DeepMD-kit and LAMMPS, the installation of this pipleine is in here: https://github.com/X4Science/paddle-deepmd/blob/paddle_progress/README.md (You can only compile inference part of DeepMD-kit) . Please revise the code (to enable OneDNN)in the init function of paddle-deepmd/source/api_cc/src/PaddleDeepPot.cc to activate OneDNN and compile DeepMD-kit later if you wanna use OneDNN. If you wanna use LAMMPS to run task of molecular dynamics simulation with inference of DeepMD-kit, please refer to the readme file above.

But the result of LAMMPS can't work well due to the accuracy of inference of DeepMD-kit based on the Paddle SHA:e1e0dee with activated OneDNN. The following screenshot is the result of molecular dynamics simulation of the whole system(containing much atoms, and DeepMD-kit offer result of inference for one atom): 截屏2022-10-25 10 47 26

截屏2022-10-25 10 48 39

The following result of LAMMPS with normal accuracy is listed here(Paddle SHA:eca6638c599591c69fe40aa196f5fd42db7efbe2 without OneDNN ):

Step PotEng KinEng TotEng Temp Press Volume 
0    -29944.14    8.1472669   -29935.993          330    8458.1696    1927.3176 
1   -29944.047    8.0544381   -29935.993    326.24002    9013.8328    1927.3176 
2   -29943.947    7.9562334   -29935.991     322.2623    9216.2879    1927.3176 
3   -29943.856    7.8668405   -29935.989     318.6415    9518.9763    1927.3176 
4   -29943.785      7.79658   -29935.989    315.79564    9360.9658    1927.3176 
5    -29943.74    7.7511252   -29935.989    313.95452    8936.0121    1927.3176 
6   -29943.721    7.7320222   -29935.989    313.18077    8286.8337    1927.3176 
7   -29943.727    7.7362499   -29935.991    313.35201    7551.3219    1927.3176 
8   -29943.746    7.7561079    -29935.99    314.15635    6581.1019    1927.3176 
9   -29943.772    7.7805873   -29935.991    315.14787    5772.1387    1927.3176 

(the illustration of the result of executing LAMMPS: the PotEng(Potentia Energy) and Press is really different in two cases for the whole system(the collection of atoms), therefor, the energy and force for the one atom is really different in two cases; Besides, the DeepMD-kit infer the energy and force for the one atom) Because the result is wrong using OneDNN (is activated to be verified), guess the reason is related to OneDNN op. The source codes of neural network locate in the DeepMD-kit part.

The performance of inference of LAMMPS with tensorflow, paddle based on mkl without OneDNN for single core and multi-progress is listed in https://github.com/X4Science/paddle-deepmd/blob/paddle_progress/README.md (using mpirun command to execute LAMMPS,and LAMMPS is designed for parallel computing with openmpi or mpich, I used openmpi to parallel compute combining LAMMPS)

yaomichael commented 1 year ago

i closed #45058 as the original bug was rootcaused and fixed. instead, i created two issues to track remaining problems