Closed grygielski closed 4 years ago
const_cast
, you can delete it.
[07:48:23] [Step 1/1] 3 . You must have one RD (XiaoguangHu01,chengduoZH,Xreki,luotao1,sneaxiy,tensor-tang) approval for the usage (either add or delete) of const_cast.
[13:53:13] : [Step 1/1] 68/183 Test #169: test_analyzer_int8_vgg16 .....................***Timeout 604.80 sec
[13:53:13] : [Step 1/1] [==========] Running 1 test from 1 test case.
[13:53:13] : [Step 1/1] [----------] Global test environment set-up.
const_cast
, just deleted some of them. Can some old ones visible in git diff make this fail?Can some old ones visible in git diff make this fail?
Yes, it is. You can update the old ones.
There is 1 problem here, I can't delete old const_casts in elementwise_mul_mkldnn_op without rewriting a lot of code. Can we update it in future refactoring after merge of this PR? Second thing is that my change is similar to yours in https://github.com/PaddlePaddle/Paddle/pull/19643 (vectorize changes that makes const_cast visible on git diff) but it fails for me.
Can we update it in future refactoring after merge of this PR
It's OK.
@luotao1 Could you please tell me if there is anything more I have to do before we can merge this PR (probably int8 test timeouts have to be fixed)? What is also important for me is how do you see the process of merging this PR?
@luotao1 Could you provide us with logs from test e.g test_analyzer_int8_resnet101 from PR_CI_Coverage machine with MKLDNN_VERBOSE=2 environmental variable? This test takes unexpected long time to execute and I want to see versions of created MKL-DNN primitives. It will be very helpful in order to debug this issue.
@grygielski How about you set MKLDNN_VERBOSE=2
in the test of this PR, then you can see the CI output log?
I've tried adding this in custom build in CI as env. variable but nothing has changed in output log.
I've tried adding this in custom build in CI as env
Could you show me the commit?
[10:42:55] I1103 10:34:43.416793 116252 mkldnn_quantizer.cc:434] == optimize 2 end ==
[10:42:55] I1103 10:34:43.420614 116252 tester_helper.h:353] Thread 0, number of threads 1, run 1 times...
[10:42:55] W1103 10:34:43.436996 116252 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
[10:42:55] W1103 10:34:43.437024 116252 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
[10:42:55] W1103 10:34:43.437029 116252 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
[10:42:55] W1103 10:40:00.620398 116252 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
[10:42:55] W1103 10:40:00.620436 116252 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
[10:42:55] W1103 10:40:00.620441 116252 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
I guess the reason why there is no MKLDNN_VERBOSE
log is that the unit-test hang. @jianhang-liu @LeoZhao-Intel Similar to problem #18575 fixed.
I didn't add it in code but in build settings on TC. It's actually in the most recent build in CI_Coverage For me it looks like these tests are just slow, not stuck. It's because for example int8_resnet50 passes but with long execution time. I need these logs to debug that.
@luotao1 @bingyanghuang
Recent conclusions: CI Machine has some problems with AVX512 flags settings since MKL-DNN 1.0 uses slower igemm implementation instead of avx512_vnni.
CI Machine: convolution,igemm_s8u8s32:jit
Our 6248 Machine: convolution,jit_int8_1x1:avx512_core_vnni
@luotao1 @bingyanghuang Recent conclusions: CI Machine has some problems with AVX512 flags settings since MKL-DNN 1.0 uses slower igemm implementation instead of avx512_vnni. CI Machine:
convolution,igemm_s8u8s32:jit
Our 6248 Machine:convolution,jit_int8_1x1:avx512_core_vnni
What can we do to fix this AVX512 flags?
We want to change PR_CI_Coverage
to new 5117 machines, but there are some unit-tests that fail on new machines (see #19505).
New 5117 machine is physical machine (now run PR_CI_python35
), but old machine is virtual machine.
@bingyanghuang Could you check #19505 is solved now?
We want to change
PR_CI_Coverage
to new 5117 machines, but there are some unit-tests that fail on new machines (see #19505).New 5117 machine is physical machine (now run
PR_CI_python35
), but old machine is virtual machine.@bingyanghuang Could you check #19505 is solved now?
However, we could not change PR_CI_COVERAGE
to new physical 5117 machines due to the following reasons:
PR_CI_COVERAGE
Thus, discussed with @bingyanghuang, we have some suggestions:
These tests would take about 15 minutes each, hence I decided to disable them for now.
@luotao1 I did it kind of automatically after upgrading MKL-DNN commit to 1.0 and getting error "libmkldnn.so.1" not found". As far as I understand, it is made to distinguish versions of libraries on system and since MKL-DNN 1.0 is incompatibile with v. 0.2, new version of .so lib has been made. @jczaja Could you confirm on that?
@luotao1 MKL-DNN when being build is setting the name of its output library. Name does include major number of version of mkl-dnn. And as we are moving from 0.20 to 1.0 , automatically the libmkldnn.so.0 is replaced by libmkldnn.so.1 . The thing is customer apps should not link to libmkldnn.so.0 neither to libmkldnn.so.1 . Customer apps should link libmkldnn.so and libmkldnn.so is a a symbolic link to libmkldnn.so.0 or libmkldnn.so.1 .
What is your use case eg. are you expecting already build applications (for PaddlePaddle with MKL-DNN 0.20) to run without rebuilding with PaddlePaddle (with MKL-DNN 1.0) ? Or you are rebuilding your Paddle based projects for new PaddlePaddle(with MKL-DNN 1.0) ? Some example CMakeLists.txt of typical customer app would also be useful.
After talking with LuoTao, the LuoTao's problem is in a lot of inference apps, the makefiles were written in this manner (linking libmkldnn.so.0), it is historic issue. if MKLDNN lib is changed to libmkldnn.so.1, then all inference compilation will fail.
libmkldnn.so.0 is a link in mkldnn build dir, while in paddle install dir, libmkldnn.so.0 is a real file instead of link (I guess maybe cp cmd issue)
@luotao1 Could you please tell me why customer apps are actually linking to mkl-dnn ? I looked into demo_ci and in C++ code there is no directly usage of MKL-DNN. Shouldn't customer apps be linking to Paddle only and its up to paddle which mkl-dnn is used ? I understand It would be to much effort to change this now, but perhaps for future apps , modification of CMakeLists.txt would be achievable so that customer apps are not linking to mkl-dnn lib unless they directly use it ?
@luotao1 I also want to point out future problems with it. Since MKL-DNN 1.1 there is a library name change to DNNL. This results in .so file named libdnnl.so.1 instead of libmkldnn.so.1. If it would be better to update to 1.1 in this PR instead of 2 separe PRs in order to have this problem just once, I can do it. I will discuss some possible solutions offline with @jczaja and post conclusions here.
If it would be better to update to 1.1 in this PR instead of 2 separe PRs in order to have this problem just once, I can do it.
@grygielski Do you mean fix the name of libmkldnn.so.1
in next PR and merge this PR at first? Sorry, we can't merge it at first, since all the inference tests in CE(continuous evaluation) will fail tomorrow.
Shouldn't customer apps be linking to Paddle only and its up to paddle which mkl-dnn is used ?
Though some of the customer apps do not use mkl-dnn, libpaddle_fluid.so
links to libmkldnn.so.0
. If we don't add libmkldnn.so.0
in CMakeLists.txt, it will fail on building. @jczaja
$ ldd libpaddle_fluid.so
linux-vdso.so.1 (0x00007ffed4574000)
libpthread.so.0 => /opt/compiler/gcc-4.8.2/lib/libpthread.so.0 (0x00007fe3ee6c2000)
libiomp5.so => /home/luotao/Paddle/cpu_build/third_party/install/mklml/lib/libiomp5.so (0x00007fe3ee2e6000)
libmkldnn.so.0 => /home/luotao/Paddle/cpu_build/third_party/install/mkldnn/lib64/libmkldnn.so.0 (0x00007fe3ed715000)
libstdc++.so.6 => /opt/compiler/gcc-4.8.2/lib/libstdc++.so.6 (0x00007fe3ed411000)
libm.so.6 => /opt/compiler/gcc-4.8.2/lib/libm.so.6 (0x00007fe3ed10e000)
libgcc_s.so.1 => /opt/compiler/gcc-4.8.2/lib/libgcc_s.so.1 (0x00007fe3ecef8000)
libc.so.6 => /opt/compiler/gcc-4.8.2/lib/libc.so.6 (0x00007fe3ecb4a000)
/opt/compiler/gcc-4.8.2/lib64/ld-linux-x86-64.so.2 (0x00007fe3f109f000)
libdl.so.2 => /opt/compiler/gcc-4.8.2/lib/libdl.so.2 (0x00007fe3ec946000)
libmklml_intel.so => /home/luotao/Paddle/cpu_build/third_party/install/mklml/lib/libmklml_intel.so (0x00007fe3e49c9000)
@luotao1 So you say that user apps link to libpaddle_fluid.so and also libmkldnn.so.0 ? So dependency graph looks like that?
@luotao1 Ok, so I think linking to mkl-dnn is a workaround covering some configuration problem of PaddlePaddle build system dependencies related to mkl-dnn. I have created an issue to track solving this problem #21297
@grygielski I think users app
->libpaddle_fluid.so
->libmkldnn.so.0
, but not users.app
->libmkldnn.so
@jczaja Do you mean you should solve #21297 at first before merge this PR?
@luotao1 In general inference app should not link to mkl-dnn lib directly, but this happens as there was a problem with building. Problem as mentioned earlier is probably in paddle cmakelists files. Demo_CI is representative to other workloads of yours , so I filed #21297 to understand problem. But Even if #21297 is fixed e.g. libmkldnn is remeved from dependencies of demo_ci, then still CMakeLists.txt of other projects that are using paddle has to be changed in similar way (removing mkldnn from dependencies). I understand that changing all CMakeLists.txt of inference projects is unachievable now,hence we prepare some workaround solution that should work with currently available CMakeLists.txt you are using in inference apps. So #21297 should not block merging this PR.
Once we finish current priority tasks then we will look to investigating #21297.
But Even if #21297 is fixed e.g. libmkldnn is removed from dependencies of demo_ci, then still CMakeLists.txt of other projects that are using paddle has to be changed in similar way (removing mkldnn from dependencies).
If libmkldnn is not used from dependencies of demo_ci, then we can keep the CMakeLists.txt
as before, i.e, if users don't remove libmkldnn, they can still build the demo_ci successfully without change the CMakeLists.txt
. @jczaja
@luotao1 We are trying to prepare some workaround for this problem but the solution will be probably very hacky and error prone. From our discussion I still think that we didn't get to the common point in understanding where the problem has its core and how it should be fixed properly. I will try to explain it as illustrative as I can. First thing is building paddle with mkldnn as dynamic library (libpaddle_fluid.so). You posted an output for 'ldd' command that shows all dependencies of libpaddle_fluid.so. There is libmkldnn.so.0 there because you have built develop branch of paddle which has mkldnn 0.2. If you would build paddle with this PR (even without all these so.0 -> so.1 changes) you would get output like this:
ldd libpaddle_fluid.so
linux-vdso.so.1 => (0x00007fff28d1b000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fc6b72bf000)
libiomp5.so => /nfs/pdx/home/agrygiel/work/Paddle_develop/build/third_party/install/mklml/lib/libiomp5.so (0x00007fc6b6ee3000)
libmkldnn.so.1 => /nfs/pdx/home/agrygiel/work/Paddle_develop/build/third_party/install/mkldnn/lib/libmkldnn.so.1 (0x00007fc6b5bbe000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fc6b583c000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fc6b5533000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fc6b531d000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc6b4f53000)
/lib64/ld-linux-x86-64.so.2 (0x00007fc6b9c82000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fc6b4d4f000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007fc6b4b2d000)
As you can see, libmkldnn.so has number .1 next to it. And it wasn't hardcoded in any place in the CMake. This is just result of linking mkldnn to paddle with this line (line 73 mkldnn.cmake file):
SET(MKLDNN_LIB "${MKLDNN_INSTALL_DIR}/${LIBDIR}/libmkldnn.so" CACHE FILEPATH "mkldnn library." FORCE)
It uses symbolic link and automatically links newer version of libmkldnn to libpaddle_fluid.so. So if any user build paddle with this PR, he will get libpaddle_fluid.so library dependent on libmkldnn.so.1. And it's not the problem, everything would work just fine with that. User won't experience any problems with building or running any app. Unless user has some terrible direct dependencies in his old CMakes. Example from demo_ci/CMakeLists.txt:
# This line adds libpaddle_fluid.so as a dependency
Line 121: set(DEPS ${PADDLE_LIB}/paddle/lib/libpaddle_fluid${CMAKE_SHARED_LIBRARY_SUFFIX})
# This adds libmkldnn.so.0 to dependencies
Line 107: set(MKLDNN_LIB ${MKLDNN_PATH}/lib/libmkldnn.so.0)
...
Line 126: set(DEPS ${DEPS} ... ${MKLDNN_LIB} ...)
In result we got direct dependency that should be only indirect like I've shown in a previous graph (this is extended version but still uses some simplifications): And this problem didn't come out because both, libpaddle_fluid and user app linked same file. Problem is now when we are changing libpaddle_fluid dependencies and graph becomes something like that: And this is quite serious bug that should be fixed. For the temporary workaround we are planning to create "dummy" libmkldnn.so.0 that symbolically links to libmkldnn.so.1. This should create such dependency graph: I think the proper final solution should be deleting this dependency from user app to libmkldnn.so.0 and just let it be just indirect dependency. If it's somehow impossible for any reason (I'm just not aware of any but there could be some), second best solution is linking to general .so file: @jczaja please corrent if I'm wrong in any place
@grygielski Thanks for your detail analysis at first!
For the temporary workaround we are planning to create "dummy" libmkldnn.so.0 that symbolically links to libmkldnn.so.1.
We found why we use real libmkldnn.so.0
instead of symbolically links. See #7181, so #7255 fix it with real libmkldnn.so.0
. Thus, do "dummy" symbolically link libmkldnn.so.0
cause the same problem in #7181?
I think the proper final solution should be deleting this dependency from user app to libmkldnn.so.0 and just let it be just indirect dependency.
Hope the final solution as well.
second best solution is linking to general .so file:
This solution has two problems: 1) Users should modify their CMakeLists.txt 2) Does it cause the same problem in #7181?
@luotao1 I changed temporary solution to just copying lib file as separate 2, libmkldnn.so.0 and libmkldnn.so.1. It should be the safest fix for now. Could you check if everything is fine with user apps now?
Could you check if everything is fine with user apps now
If PR_CI_Inference
is successful, it will be OK.
Please resolve the conflict
I rebased to develop after int8 FC PR so @Sand3r- please look at FC code once again and let me know if everything is all right.
Please resolve conflict again
@luotao1 Could you advice on PR_Windows_CI fails? I can't see any error related to my changes.
Could you advice on PR_Windows_CI fails? I can't see any error related to my changes.
@zhouwei25 Is this the random compile fail?
@luotao1 Could you advice on PR_Windows_CI fails? I can't see any error related to my changes.
It looks like Windows random failing, I'll let it run on the specified machine to verify whether it's random failing.
Could you advice on PR_Windows_CI fails? I can't see any error related to my changes.
@zhouwei25 Is this the random compile fail?
@grygielski @luotao1 it may be not random fail, there is failure log in detail.log
CustomBuild: [07:09:55] : [Step 3/6] Generating build/.timestamp [07:09:57] : [Step 3/6] fatal: no tag exactly matches '30d5d45150dcffbaa62d8481c2ae6c5bee705a57' [07:09:57] : [Step 3/6] Traceback (most recent call last): [07:09:57] : [Step 3/6] File "setup.py", line 214, in <module> [07:09:57] : [Step 3/6] shutil.copy('', libs_path) [07:09:57] : [Step 3/6] File "C:\Python27\lib\shutil.py", line 133, in copy [07:09:57] : [Step 3/6] copyfile(src, dst) [07:09:57] : [Step 3/6] File "C:\Python27\lib\shutil.py", line 96, in copyfile [07:09:57] : [Step 3/6] with open(src, 'rb') as fsrc: [07:09:57] : [Step 3/6] IOError: [Errno 22] invalid mode ('rb') or filename: ''
Whether it has any effect on setup.py?
@zhouwei25 I can see the problem, thank you so much for your investigation.
This is current version of https://github.com/PaddlePaddle/Paddle/pull/19510 rebased to latest develop branch. @luotao1 please look at it and ask questions if there are any.