Closed luotao1 closed 4 years ago
Is the same problem like #14174?
@jczaja Could you help to have a check at this? Thanks!
It seems that mkldnn hangs from the log, is a openmp problem?
@luotao1 We will look to that and get back to You
@luotao1 I did not get to reproduce problem.
Could you make a changes so that : compare_mkldnn works without mkl_dnn eg. compare(false) and then run experiments. So we know if MKL-DNN is actual problem ?
We have compare
UT which works without mkldnn, and it all runs successfully. You can see these UTs in url.
[22:59:02][Step 1/1] [ OK ] Analyzer_dam.compare (553 ms)
How did you observe that mkldnn hangs on logging? Did you attach debugger and observe where execution is looping over?
[22:59:02][Step 1/1] I1223 22:59:01.768762 59580 analysis_predictor.cc:345] == optimize end ==
[22:59:02][Step 1/1] I1223 22:59:01.771080 59580 tester_helper.h:181] Warm up run...
[22:59:02][Step 1/1] W1223 22:59:01.771121 59580 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
[22:59:02][Step 1/1] W1223 22:59:01.771126 59580 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
[22:59:02][Step 1/1] W1223 22:59:01.771132 59580 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
[22:59:02][Step 1/1]
[23:54:36][Step 1/1] 138 - test_analyzer_dam (Timeout)
Do you use parallel mode of execution of tests of ctest? What is ctest commandline to run tests?
Yes, we use parallel mode of execution of tests of ctest. We use https://github.com/PaddlePaddle/Paddle/blob/bf518ec8724e6209f029f9480f4163a9936c9229/paddle/scripts/paddle_build.sh#L415-L418 The parallel level is
@jczaja http://ci.paddlepaddle.org/viewLog.html?tab=buildLog&buildTypeId=Paddle_PrCiNight&buildId=45100&_focus=23067#_state=23743 test_analyzer_small_dam
crashs again. Why it crashs more frequently than test_analyzer_mobilenet_depthwise_conv
? And test_analyzer_resnet50
also has compare_mkldnn
, but it works normally.
@luotao1
@luotao1 From the log you indicated, I found below: 19:19:29] : [Step 1/1] -- Do not have AVX2 intrinsics and disabled MKL-DNN Is this a correct build option?
@luotao1 Please ignore my above comment. Just found there are two builds in this log. The second one should be valid one with all features including MKL turn on.
@luotao1 @jczaja One interesting point is that "dam" and "small" dam share same test app so they has exactly same test cases, i.e. "dam" also has compare_mkldnn test case. But according to log, no issue to run all test case of "dam" (including "compare_mkldnn" of course). Line 22514: [19:50:36] : [Step 1/1] 154/514 Test #138: test_analyzer_dam ............................... Passed 71.89 sec
Another hint is that "small dam" is tested immediately after "dam". This means those two models will be tested concurrently due to parallel mode of ctest. This may make over-subscription of OMP core even worse. [19:49:24] : [Step 1/1] Start 138: test_analyzer_dam ... [19:49:26] : [Step 1/1] Start 139: test_analyzer_small_dam
But according to log, no issue to run all test case of "dam" (including "compare_mkldnn" of course).
@jianhang-liu test_analyzer_dam
also have compare_mkldnn
, see http://ci.paddlepaddle.org/viewLog.html?buildId=44600&tab=buildLog&buildTypeId=Paddle_PrCiNight&logTab=tree&filter=all&_focus=22935
If test_analyzer_dam
runs successfully, there is no detailed log (containing each part of UT) in the CI log.
@jczaja
5117 is a virtual machine of 16 sockets and 1 thread per socket (1 core in 1 socket). The result of cat /proc/cpuinfo
is :
Is the number of physical threads
16?
The default openmp num threads is 1. https://github.com/PaddlePaddle/Paddle/blob/5ee596cae55e8c7082a0f03761da8cbe914e0ff3/paddle/fluid/inference/api/paddle_api.h#L199-L201
I could try run test_analyzer_small_dam
, test_analyzer_dam
, test_analyzer_resnet50
and test_analyzer_mobilenet_depthwise_conv
in sequential order. Does change to sequential order will solve the issue?
@luotao1 "I could try run test_analyzer_small_dam, test_analyzer_dam, test_analyzer_resnet50 and test_analyzer_mobilenet_depthwise_conv in sequential order". It somehow prove our suspect: the hang in "small dam" is possibly due to "oversubscription" of OMP cores since it's running concurrently with "dam". I think change the order of those tests could be a acceptable workaround for now. @jczaja What's your idea?
@luotao1 Yes, if problem is with threads over subscription then sequential execution of ctest should help. I experimented with MKL-DNN unit tests running with ctest -j4 (parallel mode execution) and there is threads oversubscription and total time of execution of all unit tests is 3 times longer and specific times of tests execution are much longer. So this situation may be similar to what we observe at 5117. So there are two experiments that I'm interested to check:
Why I wrote that sequential execution should help? Because I'm suspecting that it could be that virtual machine that of 5117 is causing threads over-subscription . I'm not expert on VM , but as we can see from logs (and @luotao1 confirmed) there is 16 threads on VM. The thing is that when setting up Virtual machine it is should be possible to dedicate some resources (cores, sockets and memory) for VM. So it is posible for example to dedicate 4 HW (physical) threads to execute VM with 16 threads. By such a setting we would have threads oversubscription. It is unlikely , as overall performance of sucha VM would be very poor, but possible. Hence:
@jczaja Thanks very much for so detailed explanation!
Make ctest running in sequential mode
Remove timeout(temporary) for parallel execution of ctest . just to confirm that is due to slow execution and not due to actual hang.
It's difficult to do this experiment since nightly stress testing runs on the develop branch.
Please tell us how do you set up VM (what kind of VM software is used?)
I'm not expert on VM as well. We will observe nightly stress testing after #15196 merged several days at first.
@jczaja @jianhang-liu @yihuaxu
test_analyzer_small_dam
timeouts at compare_mkldnn
again. Does the mkldnn kernel of conv3d
have problem?
http://ci.paddlepaddle.org/viewLog.html?buildId=48242&tab=buildLog&buildTypeId=Paddle_PrCiNight&logTab=tree&filter=all&_focus=23274
@jczaja The detail machine configuration: http://ci.paddlepaddle.org/viewLog.html?buildId=48242&tab=buildLog&buildTypeId=Paddle_PrCiNight&logTab=tree&filter=all&state=65&_focus=75#_state=65,34
[22:56:29][Step 1/1] 架构: x86_64
[22:56:29][Step 1/1] CPU op-mode(s): 32-bit, 64-bit
[22:56:29][Step 1/1] 字节序: Little Endian
[22:56:29][Step 1/1] CPU(s): 16
[22:56:29][Step 1/1] On-line CPU(s) list: 0-15
[22:56:29][Step 1/1] Thread(s) per core: 1
[22:56:29][Step 1/1] Core(s) per socket: 1
[22:56:29][Step 1/1] 座: 16
[22:56:29][Step 1/1] NUMA 节点: 1
[22:56:29][Step 1/1] 厂商 ID: GenuineIntel
[22:56:29][Step 1/1] CPU 系列: 6
[22:56:29][Step 1/1] 型号: 85
[22:56:29][Step 1/1] 型号名称: Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz
[22:56:29][Step 1/1] 步进: 4
[22:56:29][Step 1/1] CPU MHz: 2000.031
[22:56:29][Step 1/1] BogoMIPS: 4016.87
[22:56:29][Step 1/1] L1d 缓存: 32K
[22:56:29][Step 1/1] L1i 缓存: 32K
[22:56:29][Step 1/1] L2 缓存: 4096K
[22:56:29][Step 1/1] NUMA 节点0 CPU: 0-15
@luotao1 When We look at specification of Xeon Gold 5117: https://ark.intel.com/products/122460/Intel-Xeon-Gold-5117-Processor-19-25M-Cache-2-00-GHz-
There is 14 cores within single socket. And specification you sent/from logs is 16 sockets with 1 core in each socket. So Please tell me: 1) Why those specifications are different? 2) How virtualization was done? (what virtualization software used) 3) How many virtual machines are running on 5117 ? 4) how many docker containers in running simultaneously on 5117 ?
@jczaja @jianhang-liu Discussed with @tianshuo78520a
- How virtualization was done? (what virtualization software used)
We use OpenStack.
- How many virtual machines are running on 5117 ?
One or Two virtual machines run on 5117.
- how many docker containers in running simultaneously on 5117 ?
Only One docker container runs at the same time.
@luotao1 5117 is providing AVX-512. Some other platforms that I have seen only provided AVX2. My question is do you have in CI other platform that has AVX-512 ? I understand that problem(timeout) is manifesting itself only on 5117 , is there any other platform in CI that CI works fine and this platform is having AVX-512 ?
@jczaja #15335 disable conv3d mkldnn in dam to check whether conv3d mkldnn is the real reason.
I understand that problem(timeout) is manifesting itself only on 5117 , is there any other platform in CI that CI works fine and this platform is having AVX-512
I check that all the fail log is on 5117 only. We have Intel Xeon Processor (Skylake)
with AVX-512. It works OK on this issue.
http://ci.paddlepaddle.org/viewLog.html?tab=buildLog&buildTypeId=Paddle_PrCiNight&buildId=37730&_focus=75#_state=65
[21:01:39][Step 1/1] Architecture: x86_64
[21:01:39][Step 1/1] CPU op-mode(s): 32-bit, 64-bit
[21:01:39][Step 1/1] Byte Order: Little Endian
[21:01:39][Step 1/1] CPU(s): 26
[21:01:39][Step 1/1] On-line CPU(s) list: 0-25
[21:01:39][Step 1/1] Thread(s) per core: 1
[21:01:39][Step 1/1] Core(s) per socket: 1
[21:01:39][Step 1/1] Socket(s): 26
[21:01:39][Step 1/1] NUMA node(s): 1
[21:01:39][Step 1/1] Vendor ID: GenuineIntel
[21:01:39][Step 1/1] CPU family: 6
[21:01:39][Step 1/1] Model: 85
[21:01:39][Step 1/1] Model name: Intel Xeon Processor (Skylake)
[21:01:39][Step 1/1] Stepping: 4
[21:01:39][Step 1/1] CPU MHz: 1999.993
[21:01:39][Step 1/1] BogoMIPS: 3999.98
[21:01:39][Step 1/1] Hypervisor vendor: KVM
[21:01:39][Step 1/1] Virtualization type: full
[21:01:39][Step 1/1] L1d cache: 32K
[21:01:39][Step 1/1] L1i cache: 32K
[21:01:39][Step 1/1] L2 cache: 4096K
[21:01:39][Step 1/1] L3 cache: 16384K
[21:01:39][Step 1/1] NUMA node0 CPU(s): 0-25
@luotao1 The log you mentioned shows that on other SKX also test_analyzer_small_dam fails it just it is not timeout , but computational problem (diff in results). Perhaps it just different outcome of the same problem
Yes, it is the MKL diff problem, same with https://github.com/PaddlePaddle/Paddle/pull/15116#issuecomment-453030549
@luotao1 Since couple of people are looking at that issue, I'm sharing my current status. Hopefully it will be helpful.
I got reproduction of problem When running in a loop(upto hundred times) test_analyzer_dam_small .We can get either crash(segfault) or hang(timeout when running under ctest). CI on 5117 is building Paddle WITHOUT ON_INFER=ON and at that situation there can be hang of test_analyzer_small_dam test. If ON_INFER=ON is specified then randomly test_analyzer_small_dam will result in Segmentation Fault.
Hang: Execution hangs on mutex when Var is to be get from here: https://github.com/PaddlePaddle/Paddle/blob/7e651a38dd9c3e8f68b75ba8e68086ef0c9151d1/paddle/fluid/framework/operator.cc#L1064
Crash: Access to newscope::vars cause crash as vars_ contains invalid data.
Workaround: When we disable Caching of scopes: TryAndCreateTransferScope, No hang or crash is observed eg. https://github.com/PaddlePaddle/Paddle/blob/7e651a38dd9c3e8f68b75ba8e68086ef0c9151d1/paddle/fluid/framework/operator.cc#L1057
So currently We can see that there is a problem with TransferScope, that randomly manifest as hang, crash depending on ON_INFER being set or not.
@jczaja Great! Actually this afternoon we just found build withOUT ON_INFER=ON is the key to reproduce. We can easily reproduce two types of CI error now (Timeout, compare fail due to diff) in our local 6151 server. For timeout (hang) issue, we positioned the error almost same as you (but not as detail as you). We wonder whether it's caused by SCOPE_XXX_LOCK definition.
@luotao1 @jczaja We confirmed the same root cause as Jacek, i.e. current code in TryCreateTransferScope (use thread_local as cache to avoid creating transfer scope) may have random failure which cause this CI failure (hangs as timeout, segfault as crash). By simply comment out it (i.e. don't use cache; always create "new_scope"), we won't run into any error anymore. We trial with several times to enable/disable this cache and proved no issue occur when cache is disabled. @luotao1 Could you please check in framework side for a more clean fix? Thanks.
By simply comment out it (i.e. don't use cache; always create "new_scope")
@jianhang-liu Could you create a simple PR to show which lines are comment out?
@Superjomn Could you help see the cache in TryCreateTransferScope
?
https://github.com/PaddlePaddle/Paddle/issues/15032#issuecomment-455990914 is hot-fixed in #15450
http://ci.paddlepaddle.org/viewLog.html?buildId=126022&buildTypeId=Paddle_PrCiNight&tab=buildLog&_focus=20439 http://ci.paddlepaddle.org/viewLog.html?buildId=126027&buildTypeId=Paddle_PrCiNight&tab=buildLog&_focus=20449 http://ci.paddlepaddle.org/viewLog.html?buildId=125989&buildTypeId=Paddle_PrCiNight&tab=buildLog&_focus=20433
[21:16:22] [ RUN ] Analyzer_MM_DNN.mkldnn_cache_clear
hang on Analyzer_MM_DNN.mkldnn_cache_clear
currently. (three times on Nightly CI)
@jianhang-liu @LeoZhao-Intel
do we have any conclusion why it hangs with TryCreateTransferScope()? are there any race conditions or multi-instances case?
Discussed with @LeoZhao-Intel and @jianhang-liu, we have some common views:
TryCreateTransferScope()
is added for fixing GPU inference memory leak problem, which is not required for CPU inference. See the notes by @Superjomn
https://github.com/PaddlePaddle/Paddle/blob/e9c7e218f208d29b53284afd1a4c425f93acc4de/paddle/fluid/framework/operator.cc#L1106-L1107TryCreateTransferScope()
will fix the hang problem, see #15450TryCreateTransferScope()
will cause CPU inference memory leak problem, the reason is that we don't release the transfer_scope
in OperatorWithKernel::RunImpl
https://github.com/PaddlePaddle/Paddle/blob/e9c7e218f208d29b53284afd1a4c425f93acc4de/paddle/fluid/framework/operator.cc#L923-L924TODO:
TryCreateTransferScope()
and release the transfer_scope
in OperatorWithKernel::RunImpl
when using NaiveExecutor
in CPU inference. @LeoZhao-Intel 简单说下原因:
TODO (@intel @luotao):
compare_mkldnn
in three nightly CI at a same machine (5117).http://ci.paddlepaddle.org/viewLog.html?buildId=40622&tab=buildLog&buildTypeId=Paddle_PrCiNight&logTab=tree&filter=all&_focus=21635 http://ci.paddlepaddle.org/viewLog.html?tab=buildLog&buildTypeId=Paddle_PrCiNight&buildId=40460&_focus=22113 http://ci.paddlepaddle.org/viewLog.html?buildId=40324&tab=buildLog&buildTypeId=Paddle_PrCiNight&logTab=tree&filter=all&_focus=21764
compare_mkldnn
in nightly CI http://ci.paddlepaddle.org/viewLog.html?buildId=44600&tab=buildLog&buildTypeId=Paddle_PrCiNight&logTab=tree&filter=all&_focus=22935compare_mkldnn
in nightly CI http://ci.paddlepaddle.org/viewLog.html?buildId=44596&tab=buildLog&buildTypeId=Paddle_PrCiNight&logTab=tree&filter=all&_focus=23004