PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
21.66k stars 5.44k forks source link

MKL-DNN 1.0 Update #20162

Closed grygielski closed 4 years ago

grygielski commented 4 years ago

This is current version of https://github.com/PaddlePaddle/Paddle/pull/19510 rebased to latest develop branch. @luotao1 please look at it and ask questions if there are any.

luotao1 commented 4 years ago
  1. please do not use const_cast, you can delete it.
    [07:48:23]  [Step 1/1] 3 . You must have one RD (XiaoguangHu01,chengduoZH,Xreki,luotao1,sneaxiy,tensor-tang) approval for the usage (either add or delete) of const_cast.
  2. Is timeout due to update to 1.0?
    [13:53:13] :     [Step 1/1]  68/183 Test #169: test_analyzer_int8_vgg16 .....................***Timeout 604.80 sec
    [13:53:13] :     [Step 1/1] [==========] Running 1 test from 1 test case.
    [13:53:13] :     [Step 1/1] [----------] Global test environment set-up.
  3. Will you make the PR smaller for merge?
grygielski commented 4 years ago
  1. I didn't add any const_cast, just deleted some of them. Can some old ones visible in git diff make this fail?
  2. In MKL-DNN 1.0 they got rid of MKL implementations so int8 models are running really slow on AVX2 machines. We are discussing with @bingyanghuang on possible solutions to this problem.
  3. Unfortunately I cannot, all changes are correlated to each other since whole MKL-DNN execution API changed.
luotao1 commented 4 years ago

Can some old ones visible in git diff make this fail?

Yes, it is. You can update the old ones.

grygielski commented 4 years ago

There is 1 problem here, I can't delete old const_casts in elementwise_mul_mkldnn_op without rewriting a lot of code. Can we update it in future refactoring after merge of this PR? Second thing is that my change is similar to yours in https://github.com/PaddlePaddle/Paddle/pull/19643 (vectorize changes that makes const_cast visible on git diff) but it fails for me.

luotao1 commented 4 years ago

Can we update it in future refactoring after merge of this PR

It's OK.

grygielski commented 4 years ago

@luotao1 Could you please tell me if there is anything more I have to do before we can merge this PR (probably int8 test timeouts have to be fixed)? What is also important for me is how do you see the process of merging this PR?

grygielski commented 4 years ago

@luotao1 Could you provide us with logs from test e.g test_analyzer_int8_resnet101 from PR_CI_Coverage machine with MKLDNN_VERBOSE=2 environmental variable? This test takes unexpected long time to execute and I want to see versions of created MKL-DNN primitives. It will be very helpful in order to debug this issue.

luotao1 commented 4 years ago

@grygielski How about you set MKLDNN_VERBOSE=2 in the test of this PR, then you can see the CI output log?

grygielski commented 4 years ago

I've tried adding this in custom build in CI as env. variable but nothing has changed in output log.

luotao1 commented 4 years ago

I've tried adding this in custom build in CI as env

Could you show me the commit?

[10:42:55]  I1103 10:34:43.416793 116252 mkldnn_quantizer.cc:434] == optimize 2 end ==
[10:42:55]  I1103 10:34:43.420614 116252 tester_helper.h:353] Thread 0, number of threads 1, run 1 times...
[10:42:55]  W1103 10:34:43.436996 116252 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
[10:42:55]  W1103 10:34:43.437024 116252 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
[10:42:55]  W1103 10:34:43.437029 116252 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
[10:42:55]  W1103 10:40:00.620398 116252 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
[10:42:55]  W1103 10:40:00.620436 116252 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
[10:42:55]  W1103 10:40:00.620441 116252 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference

http://ci.paddlepaddle.org/viewLog.html?buildId=211675&tab=buildLog&buildTypeId=Paddle_PrCiCoverage&logTab=tree&filter=all&_focus=13813

I guess the reason why there is no MKLDNN_VERBOSE log is that the unit-test hang. @jianhang-liu @LeoZhao-Intel Similar to problem #18575 fixed.

grygielski commented 4 years ago

I didn't add it in code but in build settings on TC. It's actually in the most recent build in CI_Coverage Screenshot_2019-11-04-08-43-31-895_com android chrome For me it looks like these tests are just slow, not stuck. It's because for example int8_resnet50 passes but with long execution time. I need these logs to debug that.

grygielski commented 4 years ago

@luotao1 @bingyanghuang Recent conclusions: CI Machine has some problems with AVX512 flags settings since MKL-DNN 1.0 uses slower igemm implementation instead of avx512_vnni. CI Machine: convolution,igemm_s8u8s32:jit Our 6248 Machine: convolution,jit_int8_1x1:avx512_core_vnni

bingyanghuang commented 4 years ago

@luotao1 @bingyanghuang Recent conclusions: CI Machine has some problems with AVX512 flags settings since MKL-DNN 1.0 uses slower igemm implementation instead of avx512_vnni. CI Machine: convolution,igemm_s8u8s32:jit Our 6248 Machine: convolution,jit_int8_1x1:avx512_core_vnni

What can we do to fix this AVX512 flags?

luotao1 commented 4 years ago

We want to change PR_CI_Coverage to new 5117 machines, but there are some unit-tests that fail on new machines (see #19505).

New 5117 machine is physical machine (now run PR_CI_python35), but old machine is virtual machine.

@bingyanghuang Could you check #19505 is solved now?

bingyanghuang commented 4 years ago

We want to change PR_CI_Coverage to new 5117 machines, but there are some unit-tests that fail on new machines (see #19505).

New 5117 machine is physical machine (now run PR_CI_python35), but old machine is virtual machine.

@bingyanghuang Could you check #19505 is solved now?

19505 has already been temporarily fixed by #20041, formal fix has not been submitted yet.

luotao1 commented 4 years ago
  1. 19505 is fixed.

  2. However, we could not change PR_CI_COVERAGE to new physical 5117 machines due to the following reasons:

    • We have only 2 physical 5117 machines now, which is too less to run PR_CI_COVERAGE
    • We have a plan to create a new CI system with 15 physical 6148 machines recently.
    • The earliest test run is next Friday, but the deployment of new CI system is not sure now.
  3. Thus, discussed with @bingyanghuang, we have some suggestions:

grygielski commented 4 years ago

These tests would take about 15 minutes each, hence I decided to disable them for now.

grygielski commented 4 years ago

@luotao1 I did it kind of automatically after upgrading MKL-DNN commit to 1.0 and getting error "libmkldnn.so.1" not found". As far as I understand, it is made to distinguish versions of libraries on system and since MKL-DNN 1.0 is incompatibile with v. 0.2, new version of .so lib has been made. @jczaja Could you confirm on that?

jczaja commented 4 years ago

@luotao1 MKL-DNN when being build is setting the name of its output library. Name does include major number of version of mkl-dnn. And as we are moving from 0.20 to 1.0 , automatically the libmkldnn.so.0 is replaced by libmkldnn.so.1 . The thing is customer apps should not link to libmkldnn.so.0 neither to libmkldnn.so.1 . Customer apps should link libmkldnn.so and libmkldnn.so is a a symbolic link to libmkldnn.so.0 or libmkldnn.so.1 .

What is your use case eg. are you expecting already build applications (for PaddlePaddle with MKL-DNN 0.20) to run without rebuilding with PaddlePaddle (with MKL-DNN 1.0) ? Or you are rebuilding your Paddle based projects for new PaddlePaddle(with MKL-DNN 1.0) ? Some example CMakeLists.txt of typical customer app would also be useful.

LeoZhao-Habana commented 4 years ago

After talking with LuoTao, the LuoTao's problem is in a lot of inference apps, the makefiles were written in this manner (linking libmkldnn.so.0), it is historic issue. if MKLDNN lib is changed to libmkldnn.so.1, then all inference compilation will fail.

libmkldnn.so.0 is a link in mkldnn build dir, while in paddle install dir, libmkldnn.so.0 is a real file instead of link (I guess maybe cp cmd issue)

image

jczaja commented 4 years ago

@luotao1 Could you please tell me why customer apps are actually linking to mkl-dnn ? I looked into demo_ci and in C++ code there is no directly usage of MKL-DNN. Shouldn't customer apps be linking to Paddle only and its up to paddle which mkl-dnn is used ? I understand It would be to much effort to change this now, but perhaps for future apps , modification of CMakeLists.txt would be achievable so that customer apps are not linking to mkl-dnn lib unless they directly use it ?

grygielski commented 4 years ago

@luotao1 I also want to point out future problems with it. Since MKL-DNN 1.1 there is a library name change to DNNL. This results in .so file named libdnnl.so.1 instead of libmkldnn.so.1. If it would be better to update to 1.1 in this PR instead of 2 separe PRs in order to have this problem just once, I can do it. I will discuss some possible solutions offline with @jczaja and post conclusions here.

luotao1 commented 4 years ago

If it would be better to update to 1.1 in this PR instead of 2 separe PRs in order to have this problem just once, I can do it.

@grygielski Do you mean fix the name of libmkldnn.so.1 in next PR and merge this PR at first? Sorry, we can't merge it at first, since all the inference tests in CE(continuous evaluation) will fail tomorrow.

luotao1 commented 4 years ago

Shouldn't customer apps be linking to Paddle only and its up to paddle which mkl-dnn is used ?

Though some of the customer apps do not use mkl-dnn, libpaddle_fluid.so links to libmkldnn.so.0. If we don't add libmkldnn.so.0 in CMakeLists.txt, it will fail on building. @jczaja

$ ldd libpaddle_fluid.so
    linux-vdso.so.1 (0x00007ffed4574000)
    libpthread.so.0 => /opt/compiler/gcc-4.8.2/lib/libpthread.so.0 (0x00007fe3ee6c2000)
    libiomp5.so => /home/luotao/Paddle/cpu_build/third_party/install/mklml/lib/libiomp5.so (0x00007fe3ee2e6000)
    libmkldnn.so.0 => /home/luotao/Paddle/cpu_build/third_party/install/mkldnn/lib64/libmkldnn.so.0 (0x00007fe3ed715000)
    libstdc++.so.6 => /opt/compiler/gcc-4.8.2/lib/libstdc++.so.6 (0x00007fe3ed411000)
    libm.so.6 => /opt/compiler/gcc-4.8.2/lib/libm.so.6 (0x00007fe3ed10e000)
    libgcc_s.so.1 => /opt/compiler/gcc-4.8.2/lib/libgcc_s.so.1 (0x00007fe3ecef8000)
    libc.so.6 => /opt/compiler/gcc-4.8.2/lib/libc.so.6 (0x00007fe3ecb4a000)
    /opt/compiler/gcc-4.8.2/lib64/ld-linux-x86-64.so.2 (0x00007fe3f109f000)
    libdl.so.2 => /opt/compiler/gcc-4.8.2/lib/libdl.so.2 (0x00007fe3ec946000)
    libmklml_intel.so => /home/luotao/Paddle/cpu_build/third_party/install/mklml/lib/libmklml_intel.so (0x00007fe3e49c9000)
grygielski commented 4 years ago

@luotao1 So you say that user apps link to libpaddle_fluid.so and also libmkldnn.so.0 ? So dependency graph looks like that? image

jczaja commented 4 years ago

@luotao1 Ok, so I think linking to mkl-dnn is a workaround covering some configuration problem of PaddlePaddle build system dependencies related to mkl-dnn. I have created an issue to track solving this problem #21297

luotao1 commented 4 years ago

@grygielski I think users app->libpaddle_fluid.so->libmkldnn.so.0, but not users.app->libmkldnn.so

@jczaja Do you mean you should solve #21297 at first before merge this PR?

jczaja commented 4 years ago

@luotao1 In general inference app should not link to mkl-dnn lib directly, but this happens as there was a problem with building. Problem as mentioned earlier is probably in paddle cmakelists files. Demo_CI is representative to other workloads of yours , so I filed #21297 to understand problem. But Even if #21297 is fixed e.g. libmkldnn is remeved from dependencies of demo_ci, then still CMakeLists.txt of other projects that are using paddle has to be changed in similar way (removing mkldnn from dependencies). I understand that changing all CMakeLists.txt of inference projects is unachievable now,hence we prepare some workaround solution that should work with currently available CMakeLists.txt you are using in inference apps. So #21297 should not block merging this PR.

Once we finish current priority tasks then we will look to investigating #21297.

luotao1 commented 4 years ago

But Even if #21297 is fixed e.g. libmkldnn is removed from dependencies of demo_ci, then still CMakeLists.txt of other projects that are using paddle has to be changed in similar way (removing mkldnn from dependencies).

If libmkldnn is not used from dependencies of demo_ci, then we can keep the CMakeLists.txt as before, i.e, if users don't remove libmkldnn, they can still build the demo_ci successfully without change the CMakeLists.txt. @jczaja

grygielski commented 4 years ago

@luotao1 We are trying to prepare some workaround for this problem but the solution will be probably very hacky and error prone. From our discussion I still think that we didn't get to the common point in understanding where the problem has its core and how it should be fixed properly. I will try to explain it as illustrative as I can. First thing is building paddle with mkldnn as dynamic library (libpaddle_fluid.so). You posted an output for 'ldd' command that shows all dependencies of libpaddle_fluid.so. There is libmkldnn.so.0 there because you have built develop branch of paddle which has mkldnn 0.2. If you would build paddle with this PR (even without all these so.0 -> so.1 changes) you would get output like this:

ldd libpaddle_fluid.so
        linux-vdso.so.1 =>  (0x00007fff28d1b000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fc6b72bf000)
        libiomp5.so => /nfs/pdx/home/agrygiel/work/Paddle_develop/build/third_party/install/mklml/lib/libiomp5.so (0x00007fc6b6ee3000)
        libmkldnn.so.1 => /nfs/pdx/home/agrygiel/work/Paddle_develop/build/third_party/install/mkldnn/lib/libmkldnn.so.1 (0x00007fc6b5bbe000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fc6b583c000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fc6b5533000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fc6b531d000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc6b4f53000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fc6b9c82000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fc6b4d4f000)
        libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007fc6b4b2d000)

As you can see, libmkldnn.so has number .1 next to it. And it wasn't hardcoded in any place in the CMake. This is just result of linking mkldnn to paddle with this line (line 73 mkldnn.cmake file):

SET(MKLDNN_LIB "${MKLDNN_INSTALL_DIR}/${LIBDIR}/libmkldnn.so" CACHE FILEPATH "mkldnn library." FORCE)

It uses symbolic link and automatically links newer version of libmkldnn to libpaddle_fluid.so. So if any user build paddle with this PR, he will get libpaddle_fluid.so library dependent on libmkldnn.so.1. And it's not the problem, everything would work just fine with that. User won't experience any problems with building or running any app. Unless user has some terrible direct dependencies in his old CMakes. Example from demo_ci/CMakeLists.txt:

# This line adds libpaddle_fluid.so as a dependency
Line 121: set(DEPS ${PADDLE_LIB}/paddle/lib/libpaddle_fluid${CMAKE_SHARED_LIBRARY_SUFFIX})
# This adds libmkldnn.so.0 to dependencies
Line 107: set(MKLDNN_LIB ${MKLDNN_PATH}/lib/libmkldnn.so.0)
...
Line 126: set(DEPS ${DEPS} ... ${MKLDNN_LIB} ...)

In result we got direct dependency that should be only indirect like I've shown in a previous graph (this is extended version but still uses some simplifications): graph_prev And this problem didn't come out because both, libpaddle_fluid and user app linked same file. Problem is now when we are changing libpaddle_fluid dependencies and graph becomes something like that: graph_now And this is quite serious bug that should be fixed. For the temporary workaround we are planning to create "dummy" libmkldnn.so.0 that symbolically links to libmkldnn.so.1. This should create such dependency graph: graph_future_1 I think the proper final solution should be deleting this dependency from user app to libmkldnn.so.0 and just let it be just indirect dependency. If it's somehow impossible for any reason (I'm just not aware of any but there could be some), second best solution is linking to general .so file: graph_final @jczaja please corrent if I'm wrong in any place

luotao1 commented 4 years ago

@grygielski Thanks for your detail analysis at first!

For the temporary workaround we are planning to create "dummy" libmkldnn.so.0 that symbolically links to libmkldnn.so.1.

We found why we use real libmkldnn.so.0 instead of symbolically links. See #7181, so #7255 fix it with real libmkldnn.so.0. Thus, do "dummy" symbolically link libmkldnn.so.0 cause the same problem in #7181?

I think the proper final solution should be deleting this dependency from user app to libmkldnn.so.0 and just let it be just indirect dependency.

Hope the final solution as well.

second best solution is linking to general .so file:

This solution has two problems: 1) Users should modify their CMakeLists.txt 2) Does it cause the same problem in #7181?

grygielski commented 4 years ago

@luotao1 I changed temporary solution to just copying lib file as separate 2, libmkldnn.so.0 and libmkldnn.so.1. It should be the safest fix for now. Could you check if everything is fine with user apps now?

luotao1 commented 4 years ago

Could you check if everything is fine with user apps now

If PR_CI_Inference is successful, it will be OK.

luotao1 commented 4 years ago

Please resolve the conflict

grygielski commented 4 years ago

I rebased to develop after int8 FC PR so @Sand3r- please look at FC code once again and let me know if everything is all right.

luotao1 commented 4 years ago

Please resolve conflict again

grygielski commented 4 years ago

@luotao1 Could you advice on PR_Windows_CI fails? I can't see any error related to my changes.

luotao1 commented 4 years ago

Could you advice on PR_Windows_CI fails? I can't see any error related to my changes.

@zhouwei25 Is this the random compile fail?

zhwesky2010 commented 4 years ago

@luotao1 Could you advice on PR_Windows_CI fails? I can't see any error related to my changes.

It looks like Windows random failing, I'll let it run on the specified machine to verify whether it's random failing.

zhwesky2010 commented 4 years ago

Could you advice on PR_Windows_CI fails? I can't see any error related to my changes.

@zhouwei25 Is this the random compile fail?

@grygielski @luotao1 it may be not random fail, there is failure log in detail.log CustomBuild: [07:09:55] : [Step 3/6] Generating build/.timestamp [07:09:57] : [Step 3/6] fatal: no tag exactly matches '30d5d45150dcffbaa62d8481c2ae6c5bee705a57' [07:09:57] : [Step 3/6] Traceback (most recent call last): [07:09:57] : [Step 3/6] File "setup.py", line 214, in <module> [07:09:57] : [Step 3/6] shutil.copy('', libs_path) [07:09:57] : [Step 3/6] File "C:\Python27\lib\shutil.py", line 133, in copy [07:09:57] : [Step 3/6] copyfile(src, dst) [07:09:57] : [Step 3/6] File "C:\Python27\lib\shutil.py", line 96, in copyfile [07:09:57] : [Step 3/6] with open(src, 'rb') as fsrc: [07:09:57] : [Step 3/6] IOError: [Errno 22] invalid mode ('rb') or filename: '' Whether it has any effect on setup.py?

grygielski commented 4 years ago

@zhouwei25 I can see the problem, thank you so much for your investigation.