PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
21.66k stars 5.44k forks source link

test_analyzer_* random crashs at `compare_mkldnn` #15032

Closed luotao1 closed 4 years ago

luotao1 commented 5 years ago
luotao1 commented 5 years ago

Is the same problem like #14174?

jianhang-liu commented 5 years ago

@jczaja Could you help to have a check at this? Thanks!

luotao1 commented 5 years ago

It seems that mkldnn hangs from the log, is a openmp problem?

jczaja commented 5 years ago

@luotao1 We will look to that and get back to You

jczaja commented 5 years ago

@luotao1 I did not get to reproduce problem.

  1. Could you make a changes so that : compare_mkldnn works without mkl_dnn eg. compare(false) and then run experiments. So we know if MKL-DNN is actual problem ?
  2. How did you observe that mkldnn hangs on logging? Did you attach debugger and observe where execution is looping over?
  3. Do you use parallel mode of execution of tests of ctest? What is ctest commandline to run tests?
luotao1 commented 5 years ago

Could you make a changes so that : compare_mkldnn works without mkl_dnn eg. compare(false) and then run experiments. So we know if MKL-DNN is actual problem ?

We have compare UT which works without mkldnn, and it all runs successfully. You can see these UTs in url.

[22:59:02][Step 1/1] [       OK ] Analyzer_dam.compare (553 ms)

How did you observe that mkldnn hangs on logging? Did you attach debugger and observe where execution is looping over?

  1. All the logs like as follows. Since there is no error log, I guess mkldnn hangs. But I didn't attach debugger and observe where execution is looping over.
    [22:59:02][Step 1/1] I1223 22:59:01.768762 59580 analysis_predictor.cc:345] == optimize end ==
    [22:59:02][Step 1/1] I1223 22:59:01.771080 59580 tester_helper.h:181] Warm up run...
    [22:59:02][Step 1/1] W1223 22:59:01.771121 59580 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
    [22:59:02][Step 1/1] W1223 22:59:01.771126 59580 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
    [22:59:02][Step 1/1] W1223 22:59:01.771132 59580 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
    [22:59:02][Step 1/1] 
    [23:54:36][Step 1/1]    138 - test_analyzer_dam (Timeout)
  2. Since some hangs in our business job due to openmp library (you can ask @jianhang-liu for more details), I guess that maybe openmp hangs in our CI as well.

Do you use parallel mode of execution of tests of ctest? What is ctest commandline to run tests?

Yes, we use parallel mode of execution of tests of ctest. We use https://github.com/PaddlePaddle/Paddle/blob/bf518ec8724e6209f029f9480f4163a9936c9229/paddle/scripts/paddle_build.sh#L415-L418 The parallel level is image

luotao1 commented 5 years ago

@jczaja http://ci.paddlepaddle.org/viewLog.html?tab=buildLog&buildTypeId=Paddle_PrCiNight&buildId=45100&_focus=23067#_state=23743 test_analyzer_small_dam crashs again. Why it crashs more frequently than test_analyzer_mobilenet_depthwise_conv? And test_analyzer_resnet50 also has compare_mkldnn, but it works normally.

jczaja commented 5 years ago

@luotao1

  1. 5117 seems to be virtual machine of 16 sockets and 1 thread per socket (1 core in 1 socket). So 16 logical threads. Could you please tell me how many physical cores/sockets where dedicated to that virtual machine? For example if 5117 is virtual machine with 16 logical threads, which underneath is using 4 physical threads then we would have threads oversubscription when using MKL-DNN four tasks at once. PaddlePaddle is having timeout set to 600s , when there is oversubscription of threads, it may happen that tasks running in parallel will have much longer execution time(as they are fighting for resources) Please check settings of this virtual machine and let me know how many physical threads is in use for 5117.
  2. What is target num threads value that is to be used for each test? How to you set omp num threads so it take into consideration number of threads in 5117 divided by number of ctest parallel tasks running?
  3. Could you please run on this machine same build but with ctest executing test in sequential order (no parallel mode execution) ?
  4. If execution of previous points gives us no information then, could you please disable ctest timeout and run tests so we see if this is just very log execution or hang
jianhang-liu commented 5 years ago

@luotao1 From the log you indicated, I found below: 19:19:29] : [Step 1/1] -- Do not have AVX2 intrinsics and disabled MKL-DNN Is this a correct build option?

jianhang-liu commented 5 years ago

@luotao1 Please ignore my above comment. Just found there are two builds in this log. The second one should be valid one with all features including MKL turn on.

jianhang-liu commented 5 years ago

@luotao1 @jczaja One interesting point is that "dam" and "small" dam share same test app so they has exactly same test cases, i.e. "dam" also has compare_mkldnn test case. But according to log, no issue to run all test case of "dam" (including "compare_mkldnn" of course). Line 22514: [19:50:36] : [Step 1/1] 154/514 Test #138: test_analyzer_dam ............................... Passed 71.89 sec

jianhang-liu commented 5 years ago

Another hint is that "small dam" is tested immediately after "dam". This means those two models will be tested concurrently due to parallel mode of ctest. This may make over-subscription of OMP core even worse. [19:49:24] : [Step 1/1] Start 138: test_analyzer_dam ... [19:49:26] : [Step 1/1] Start 139: test_analyzer_small_dam

luotao1 commented 5 years ago

But according to log, no issue to run all test case of "dam" (including "compare_mkldnn" of course).

@jianhang-liu test_analyzer_dam also have compare_mkldnn, see http://ci.paddlepaddle.org/viewLog.html?buildId=44600&tab=buildLog&buildTypeId=Paddle_PrCiNight&logTab=tree&filter=all&_focus=22935

If test_analyzer_dam runs successfully, there is no detailed log (containing each part of UT) in the CI log.

luotao1 commented 5 years ago

@jczaja

  1. 5117 is a virtual machine of 16 sockets and 1 thread per socket (1 core in 1 socket). The result of cat /proc/cpuinfo is : image Is the number of physical threads 16?

  2. The default openmp num threads is 1. https://github.com/PaddlePaddle/Paddle/blob/5ee596cae55e8c7082a0f03761da8cbe914e0ff3/paddle/fluid/inference/api/paddle_api.h#L199-L201

  3. I could try run test_analyzer_small_dam, test_analyzer_dam, test_analyzer_resnet50 and test_analyzer_mobilenet_depthwise_conv in sequential order. Does change to sequential order will solve the issue?

jianhang-liu commented 5 years ago

@luotao1 "I could try run test_analyzer_small_dam, test_analyzer_dam, test_analyzer_resnet50 and test_analyzer_mobilenet_depthwise_conv in sequential order". It somehow prove our suspect: the hang in "small dam" is possibly due to "oversubscription" of OMP cores since it's running concurrently with "dam". I think change the order of those tests could be a acceptable workaround for now. @jczaja What's your idea?

jczaja commented 5 years ago

@luotao1 Yes, if problem is with threads over subscription then sequential execution of ctest should help. I experimented with MKL-DNN unit tests running with ctest -j4 (parallel mode execution) and there is threads oversubscription and total time of execution of all unit tests is 3 times longer and specific times of tests execution are much longer. So this situation may be similar to what we observe at 5117. So there are two experiments that I'm interested to check:

  1. Make ctest running in sequential mode
  2. Remove timeout(temporary) for parallel execution of ctest . just to confirm that is due to slow execution and not due to actual hang.

Why I wrote that sequential execution should help? Because I'm suspecting that it could be that virtual machine that of 5117 is causing threads over-subscription . I'm not expert on VM , but as we can see from logs (and @luotao1 confirmed) there is 16 threads on VM. The thing is that when setting up Virtual machine it is should be possible to dedicate some resources (cores, sockets and memory) for VM. So it is posible for example to dedicate 4 HW (physical) threads to execute VM with 16 threads. By such a setting we would have threads oversubscription. It is unlikely , as overall performance of sucha VM would be very poor, but possible. Hence:

  1. Please tell us how do you set up VM (what kind of VM software is used?) and check if possible how many physicall resources where dedicated to create VM of 16 sockets of 1 CPU with 1 thread .
  2. How many VMs are running on 5117 . The other reasons of threads over-subscription could be that more than one VMs are sharing 5117 , and there is not enough threads to be shared among them?
luotao1 commented 5 years ago

@jczaja Thanks very much for so detailed explanation!

Make ctest running in sequential mode

15196 make related ctest running in sequential mode.

Remove timeout(temporary) for parallel execution of ctest . just to confirm that is due to slow execution and not due to actual hang.

It's difficult to do this experiment since nightly stress testing runs on the develop branch.

Please tell us how do you set up VM (what kind of VM software is used?)

I'm not expert on VM as well. We will observe nightly stress testing after #15196 merged several days at first.

luotao1 commented 5 years ago

@jczaja @jianhang-liu @yihuaxu test_analyzer_small_dam timeouts at compare_mkldnn again. Does the mkldnn kernel of conv3d have problem? http://ci.paddlepaddle.org/viewLog.html?buildId=48242&tab=buildLog&buildTypeId=Paddle_PrCiNight&logTab=tree&filter=all&_focus=23274

luotao1 commented 5 years ago

@jczaja The detail machine configuration: http://ci.paddlepaddle.org/viewLog.html?buildId=48242&tab=buildLog&buildTypeId=Paddle_PrCiNight&logTab=tree&filter=all&state=65&_focus=75#_state=65,34

[22:56:29][Step 1/1] 架构:              x86_64
[22:56:29][Step 1/1] CPU op-mode(s):        32-bit, 64-bit
[22:56:29][Step 1/1] 字节序:           Little Endian
[22:56:29][Step 1/1] CPU(s):                16
[22:56:29][Step 1/1] On-line CPU(s) list:   0-15
[22:56:29][Step 1/1] Thread(s) per core:    1
[22:56:29][Step 1/1] Core(s) per socket:    1
[22:56:29][Step 1/1] 座:                 16
[22:56:29][Step 1/1] NUMA 节点:         1
[22:56:29][Step 1/1] 厂商 ID:           GenuineIntel
[22:56:29][Step 1/1] CPU 系列:          6
[22:56:29][Step 1/1] 型号:              85
[22:56:29][Step 1/1] 型号名称:        Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz
[22:56:29][Step 1/1] 步进:              4
[22:56:29][Step 1/1] CPU MHz:             2000.031
[22:56:29][Step 1/1] BogoMIPS:            4016.87
[22:56:29][Step 1/1] L1d 缓存:          32K
[22:56:29][Step 1/1] L1i 缓存:          32K
[22:56:29][Step 1/1] L2 缓存:           4096K
[22:56:29][Step 1/1] NUMA 节点0 CPU:    0-15
jczaja commented 5 years ago

@luotao1 When We look at specification of Xeon Gold 5117: https://ark.intel.com/products/122460/Intel-Xeon-Gold-5117-Processor-19-25M-Cache-2-00-GHz-

There is 14 cores within single socket. And specification you sent/from logs is 16 sockets with 1 core in each socket. So Please tell me: 1) Why those specifications are different? 2) How virtualization was done? (what virtualization software used) 3) How many virtual machines are running on 5117 ? 4) how many docker containers in running simultaneously on 5117 ?

luotao1 commented 5 years ago

@jczaja @jianhang-liu Discussed with @tianshuo78520a

  1. How virtualization was done? (what virtualization software used)

We use OpenStack.

  1. How many virtual machines are running on 5117 ?

One or Two virtual machines run on 5117.

  1. how many docker containers in running simultaneously on 5117 ?

Only One docker container runs at the same time.

jczaja commented 5 years ago

@luotao1 5117 is providing AVX-512. Some other platforms that I have seen only provided AVX2. My question is do you have in CI other platform that has AVX-512 ? I understand that problem(timeout) is manifesting itself only on 5117 , is there any other platform in CI that CI works fine and this platform is having AVX-512 ?

luotao1 commented 5 years ago

@jczaja #15335 disable conv3d mkldnn in dam to check whether conv3d mkldnn is the real reason.

luotao1 commented 5 years ago

I understand that problem(timeout) is manifesting itself only on 5117 , is there any other platform in CI that CI works fine and this platform is having AVX-512

I check that all the fail log is on 5117 only. We have Intel Xeon Processor (Skylake) with AVX-512. It works OK on this issue. http://ci.paddlepaddle.org/viewLog.html?tab=buildLog&buildTypeId=Paddle_PrCiNight&buildId=37730&_focus=75#_state=65

[21:01:39][Step 1/1] Architecture:          x86_64
[21:01:39][Step 1/1] CPU op-mode(s):        32-bit, 64-bit
[21:01:39][Step 1/1] Byte Order:            Little Endian
[21:01:39][Step 1/1] CPU(s):                26
[21:01:39][Step 1/1] On-line CPU(s) list:   0-25
[21:01:39][Step 1/1] Thread(s) per core:    1
[21:01:39][Step 1/1] Core(s) per socket:    1
[21:01:39][Step 1/1] Socket(s):             26
[21:01:39][Step 1/1] NUMA node(s):          1
[21:01:39][Step 1/1] Vendor ID:             GenuineIntel
[21:01:39][Step 1/1] CPU family:            6
[21:01:39][Step 1/1] Model:                 85
[21:01:39][Step 1/1] Model name:            Intel Xeon Processor (Skylake)
[21:01:39][Step 1/1] Stepping:              4
[21:01:39][Step 1/1] CPU MHz:               1999.993
[21:01:39][Step 1/1] BogoMIPS:              3999.98
[21:01:39][Step 1/1] Hypervisor vendor:     KVM
[21:01:39][Step 1/1] Virtualization type:   full
[21:01:39][Step 1/1] L1d cache:             32K
[21:01:39][Step 1/1] L1i cache:             32K
[21:01:39][Step 1/1] L2 cache:              4096K
[21:01:39][Step 1/1] L3 cache:              16384K
[21:01:39][Step 1/1] NUMA node0 CPU(s):     0-25
jczaja commented 5 years ago

@luotao1 The log you mentioned shows that on other SKX also test_analyzer_small_dam fails it just it is not timeout , but computational problem (diff in results). Perhaps it just different outcome of the same problem

luotao1 commented 5 years ago

Yes, it is the MKL diff problem, same with https://github.com/PaddlePaddle/Paddle/pull/15116#issuecomment-453030549

jczaja commented 5 years ago

@luotao1 Since couple of people are looking at that issue, I'm sharing my current status. Hopefully it will be helpful.

I got reproduction of problem When running in a loop(upto hundred times) test_analyzer_dam_small .We can get either crash(segfault) or hang(timeout when running under ctest). CI on 5117 is building Paddle WITHOUT ON_INFER=ON and at that situation there can be hang of test_analyzer_small_dam test. If ON_INFER=ON is specified then randomly test_analyzer_small_dam will result in Segmentation Fault.

Hang: Execution hangs on mutex when Var is to be get from here: https://github.com/PaddlePaddle/Paddle/blob/7e651a38dd9c3e8f68b75ba8e68086ef0c9151d1/paddle/fluid/framework/operator.cc#L1064

Crash: Access to newscope::vars cause crash as vars_ contains invalid data.

Workaround: When we disable Caching of scopes: TryAndCreateTransferScope, No hang or crash is observed eg. https://github.com/PaddlePaddle/Paddle/blob/7e651a38dd9c3e8f68b75ba8e68086ef0c9151d1/paddle/fluid/framework/operator.cc#L1057

So currently We can see that there is a problem with TransferScope, that randomly manifest as hang, crash depending on ON_INFER being set or not.

jianhang-liu commented 5 years ago

@jczaja Great! Actually this afternoon we just found build withOUT ON_INFER=ON is the key to reproduce. We can easily reproduce two types of CI error now (Timeout, compare fail due to diff) in our local 6151 server. For timeout (hang) issue, we positioned the error almost same as you (but not as detail as you). We wonder whether it's caused by SCOPE_XXX_LOCK definition.

jianhang-liu commented 5 years ago

@luotao1 @jczaja We confirmed the same root cause as Jacek, i.e. current code in TryCreateTransferScope (use thread_local as cache to avoid creating transfer scope) may have random failure which cause this CI failure (hangs as timeout, segfault as crash). By simply comment out it (i.e. don't use cache; always create "new_scope"), we won't run into any error anymore. We trial with several times to enable/disable this cache and proved no issue occur when cache is disabled. @luotao1 Could you please check in framework side for a more clean fix? Thanks.

luotao1 commented 5 years ago

By simply comment out it (i.e. don't use cache; always create "new_scope")

@jianhang-liu Could you create a simple PR to show which lines are comment out?

@Superjomn Could you help see the cache in TryCreateTransferScope?

luotao1 commented 5 years ago

https://github.com/PaddlePaddle/Paddle/blob/6597ccb01f1144cb4ba206de07fa9688631b4850/paddle/fluid/framework/transfer_scope_cache.cc#L30-L46 @Superjomn Is line45 unused?

luotao1 commented 5 years ago

https://github.com/PaddlePaddle/Paddle/issues/15032#issuecomment-455990914 is hot-fixed in #15450

luotao1 commented 4 years ago

http://ci.paddlepaddle.org/viewLog.html?buildId=126022&buildTypeId=Paddle_PrCiNight&tab=buildLog&_focus=20439 http://ci.paddlepaddle.org/viewLog.html?buildId=126027&buildTypeId=Paddle_PrCiNight&tab=buildLog&_focus=20449 http://ci.paddlepaddle.org/viewLog.html?buildId=125989&buildTypeId=Paddle_PrCiNight&tab=buildLog&_focus=20433

[21:16:22]  [ RUN      ] Analyzer_MM_DNN.mkldnn_cache_clear

hang on Analyzer_MM_DNN.mkldnn_cache_clear currently. (three times on Nightly CI) @jianhang-liu @LeoZhao-Intel

LeoZhao-Habana commented 4 years ago

do we have any conclusion why it hangs with TryCreateTransferScope()? are there any race conditions or multi-instances case?

luotao1 commented 4 years ago

English Description

Discussed with @LeoZhao-Intel and @jianhang-liu, we have some common views:

  1. TryCreateTransferScope() is added for fixing GPU inference memory leak problem, which is not required for CPU inference. See the notes by @Superjomn https://github.com/PaddlePaddle/Paddle/blob/e9c7e218f208d29b53284afd1a4c425f93acc4de/paddle/fluid/framework/operator.cc#L1106-L1107
  2. remove TryCreateTransferScope() will fix the hang problem, see #15450
  3. However, remove TryCreateTransferScope() will cause CPU inference memory leak problem, the reason is that we don't release the transfer_scope in OperatorWithKernel::RunImpl https://github.com/PaddlePaddle/Paddle/blob/e9c7e218f208d29b53284afd1a4c425f93acc4de/paddle/fluid/framework/operator.cc#L923-L924

TODO:

  1. remove TryCreateTransferScope() and release the transfer_scope in OperatorWithKernel::RunImpl when using NaiveExecutor in CPU inference. @LeoZhao-Intel
  2. test the memory leak problem in detect model #18372 @LeoZhao-Intel @luotao1

Chinese Description

简单说下原因:

  1. 为了解决GPU预测内存泄漏,@春伟 加了TryCreateTransferScope函数,同时不释放transfer_scope。
  2. TryCreateTransferScope函数对CPU预测是不需要的,去掉该函数后,就不会有随机timeout的问题(已验证)。
  3. 目前去掉TryCreateTransferScope函数后,CPU预测会出现内存涨的问题。原因是没有释放transfer_scope。

TODO (@intel @luotao):

  1. 在CPU预测的时候,不使用TryCreateTransferScope函数,同时释放transfer_scope。
  2. 验证CPU预测时,内存泄漏问题、随机timeout问题和预测性能问题。