PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.19k stars 5.57k forks source link

Accuracy is wrong when inference deepmd with oneDNN #47357

Closed yaomichael closed 1 year ago

yaomichael commented 2 years ago

bug描述 Describe the Bug

This issue is created as a follow-up of #45058 where the original error was root caused and fixed. This issue is assume inference can run (this is true at certain commits only today), e.g. SHA:e1e0dee, the resulted accuracy is abnormal.

below is copied from @leeleolay 's last comment at #45058

The target of this work is to using multi-threads with Paddle using OneDNN to speed up the process of inference

The pipeline of the inference is combining with Paddle, DeepMD-kit(training and inference, is one step of molecular dynamics simulation), LAMMPS(Molecular Dynamics simulation). deep_md_test repo is only used to test whether paddle can work well.

I using the Paddle SHA:e1e0dee to compile DeepMD-kit and LAMMPS, the installation of this pipleine is in here: https://github.com/X4Science/paddle-deepmd/blob/paddle_progress/README.md (You can only compile inference part of DeepMD-kit) . Please revise the code (to enable OneDNN)in the init function of paddle-deepmd/source/api_cc/src/PaddleDeepPot.cc to activate OneDNN and compile DeepMD-kit later if you wanna use OneDNN. If you wanna use LAMMPS to run task of molecular dynamics simulation with inference of DeepMD-kit, please refer to the readme file above.

But the result of LAMMPS can't work well due to the accuracy of inference of DeepMD-kit based on the Paddle SHA:e1e0dee with activated OneDNN. The following screenshot is the result of molecular dynamics simulation of the whole system(containing much atoms, and DeepMD-kit offer result of inference for one atom):

image image

The following result of LAMMPS with normal accuracy is listed here(Paddle SHA:eca6638c599591c69fe40aa196f5fd42db7efbe2 without OneDNN ):

Step PotEng KinEng TotEng Temp Press Volume 0 -29944.14 8.1472669 -29935.993 330 8458.1696 1927.3176 1 -29944.047 8.0544381 -29935.993 326.24002 9013.8328 1927.3176 2 -29943.947 7.9562334 -29935.991 322.2623 9216.2879 1927.3176 3 -29943.856 7.8668405 -29935.989 318.6415 9518.9763 1927.3176 4 -29943.785 7.79658 -29935.989 315.79564 9360.9658 1927.3176 5 -29943.74 7.7511252 -29935.989 313.95452 8936.0121 1927.3176 6 -29943.721 7.7320222 -29935.989 313.18077 8286.8337 1927.3176 7 -29943.727 7.7362499 -29935.991 313.35201 7551.3219 1927.3176 8 -29943.746 7.7561079 -29935.99 314.15635 6581.1019 1927.3176 9 -29943.772 7.7805873 -29935.991 315.14787 5772.1387 1927.3176 (the illustration of the result of executing LAMMPS: the PotEng(Potentia Energy) and Press is really different in two cases for the whole system(the collection of atoms), therefor, the energy and force for the one atom is really different in two cases; Besides, the DeepMD-kit infer the energy and force for the one atom) Because the result is wrong using OneDNN (is activated to be verified), guess the reason is related to OneDNN op. The source codes of neural network locate in the DeepMD-kit part.

The performance of inference of LAMMPS with tensorflow, paddle based on mkl without OneDNN for single core and multi-progress is listed in https://github.com/X4Science/paddle-deepmd/blob/paddle_progress/README.md (using mpirun command to execute LAMMPS,and LAMMPS is designed for parallel computing with openmpi or mpich, I used openmpi to parallel compute combining LAMMPS)

其他补充信息 Additional Supplementary Information

No response

tsocha commented 2 years ago

@leeleolay Could you test if this patch works for you? I'm still working on this issue but I want to confirm that this pass is a possible culprit here.

diff --git a/paddle/fluid/inference/api/paddle_pass_builder.cc b/paddle/fluid/inference/api/paddle_pass_builder.cc
index 6119714c38..1bb6389b97 100644
--- a/paddle/fluid/inference/api/paddle_pass_builder.cc
+++ b/paddle/fluid/inference/api/paddle_pass_builder.cc
@@ -312,7 +312,7 @@ void CpuPassStrategy::EnableMKLDNN() {
              "batch_norm_act_fuse_pass",              //
              "softplus_activation_mkldnn_fuse_pass",  //
              "shuffle_channel_mkldnn_detect_pass",    //
-             "elt_act_mkldnn_fuse_pass",              //
+             // "elt_act_mkldnn_fuse_pass",              //
              // TODO(intel): Please fix the bug on windows.
              // https://github.com/PaddlePaddle/Paddle/issues/29710
              // "mkldnn_inplace_pass",  // This pass should be activated after
tsocha commented 2 years ago

It seems that deep_md model is still a fp64 model, it was not converted to fp32. obraz OneDNN does not support this datatype on CPU. I discovered that the pass mentioned above(elt_act_mkldnn_fuse_pass) incorrectly fuses operator which is executed by PaddlePaddle's reference kernel. logs:

I1027 01:38:58.053917 338935 operator.cc:1842] op type:elementwise_add, expected_kernel_key:{data_type[double]; data_layout[Undefined(AnyLayout)]; place[Place(cpu)]; library_type[PLAIN]}
I1027 01:38:58.056318 338935 operator.cc:276] Place(cpu) Op(elementwise_add), inputs:{X[matmul_0.tmp_0:double[2944, 25]({})(Place(cpu))], Y[filter_type_00_0.b_0:double[1, 25]({})(Place(cpu))]}, outputs:{Out[tanh_0.tmp_0:double[2944, 25]({})(Place(cpu))]}.

This behavior leads to incorrect result -> bad accuracy.

I will work to fix this accuracy error but until model will be converted to fp32 properly there will be no performance improvement from oneDNN because these ops are run on reference kernels.

yaomichael commented 1 year ago

not sure why the bot closed it... I reopen it since issue is NOT resolved yet.

leeleolay commented 1 year ago

@tsocha I check the model, the setting of the compiling using the low precision. and most op is using float32, and the some op during the mid of the process using float64, I think this maybe because of compiling Paddle with double precision? Coule you give me some more specific instruction?Besides,you think this decision of the op belongs to Paddle or DeepMD-kit?

tsocha commented 1 year ago

@tsocha I check the model, the setting of the compiling using the low precision. and most op is using float32, and the some op during the mid of the process using float64, I think this maybe because of compiling Paddle with double precision? Coule you give me some more specific instruction?

I just opened this model in netron tool: https://github.com/tsocha/deep_md_test/blob/master/model.ckpt/model.pdmodel

bin1guo commented 1 year ago

(elt_act_mkldnn_fuse_pass)

@tsocha is there some progress on this pass accuracy issue? thx

tsocha commented 1 year ago

Yes, we are currently working on it. We already found a reason why this pass didn't work here. It seems that mkldnn_placement_pass is broken, fix is in progress.

bin1guo commented 1 year ago

Yes, we are currently working on it. We already found a reason why this pass didn't work here. It seems that mkldnn_placement_pass is broken, fix is in progress.

Got it, it is good news. BTW, can it be fixed next week? thank you.

yaomichael commented 1 year ago

i try to summarize the problem:

The accuracy issue is caused by a bug in mkldnn_placement_pass, it caused problems in elt_act_mkldnn_fuse_pass which mandate fp32 and oneDNN, but the input is float64 which is not supported by oneDNN.

the easiest way is to disable elt_act_mkldnn_fuse_pass, as suggested in https://github.com/PaddlePaddle/Paddle/issues/47357#issuecomment-1293407909, the result is oneDNN will not be used and the performance is the baseline performance (of paddle reference implementation) A logically better fix will be correcting mkldnn_placement_pass which @tsocha is working on, but again, most computing will be on paddle reference implementation, so no performance improvement. For this model, it's the same result (accuracy fixed, performance no improvement) as the easy workaround. but this fix is expected to benefit other models.

If we want performance boost, we would like to see the model inference on oneDNN, this means the model datatype must be fp32, instead of float64. This can be done only by Baidu

tsocha commented 1 year ago

Yes, we are currently working on it. We already found a reason why this pass didn't work here. It seems that mkldnn_placement_pass is broken, fix is in progress.

Got it, it is good news. BTW, can it be fixed next week? thank you.

That's our plan.

leeleolay commented 1 year ago

I upload float32 model in this repo , please check it https://github.com/leeleolay/deep_md_test/tree/main/modelfp32.ckpt the result of running LAMMPS looks like normal, but the lammps is terminated after running 10 steps. the error info is listed following link and it looks that relates to paddle op. error.txt

bin1guo commented 1 year ago

@tsocha can you help check whether this float32 model is okay in LAMMPS? thanks. BTW, are there any updates about mkldnn_placement_pass?

tsocha commented 1 year ago

@leeleolay I have checked the error logs. I looks like a new shape calculated in a graph is invalid:

[1, 5454] vs [-1, 5442]
[1, 5385] vs [-1, 5379]

At first glance the main suspect is a numerical problem after scaling a dimension.

Do LAMMPS crash when oneDNN is disabled as well?

@bin1guo mkldnn_placement_pass << @wozna work on that, she could give you more information soon.

It's hard for me to help with checking the LAMMPS currently because I'm on parental leave right now. Someone could back-up me but you need to discuss priorities with @yaomichael.

wozna commented 1 year ago

Yes, that's right, I'm working on a task to add checks for registered kernels with a specific data type. A draft is here https://github.com/PaddlePaddle/Paddle/pull/48150. I'm still consulting some things, but soon this change should be ready.

leeleolay commented 1 year ago

priorities

The LAMMPS works well when oneDNN is disabled.

bin1guo commented 1 year ago

thanks, @wozna, is there any update for this issue?

bin1guo commented 1 year ago

@tsocha Great news, Congratulations~

yaomichael commented 1 year ago

@leeleolay as summarized in https://github.com/PaddlePaddle/Paddle/issues/47357#issuecomment-1302879996

either options will work 1, disable onednn - you confirmed it works 2, fix mkldnn_placement_pass - no direct help to this issue if option 1 is in place 3, convert model datatype to fp32 (from fp64)

are you okay to close this issue?

@onecatcn