luotao1 commented 5 years ago

[02:50:55]  I0327 02:50:55.226866 88684 helper.h:270] ====== batch_size: 1, repeat: 1, threads: 1, thread id: 0, latency: 34.1772ms, fps: 29.2592 ======
[02:50:55]  I0327 02:50:55.226940 88684 tester_helper.h:301] Thread 0 run 1 times...
[02:50:55]  I0327 02:50:55.258460 88684 helper.h:270] ====== batch_size: 1, repeat: 1, threads: 1, thread id: 0, latency: 31.4899ms, fps: 31.7562 ======
[02:50:55]  F0327 02:50:55.268436 88684 tester_helper.h:97] Check failed: std::abs(pdata_ref[j] - pdata[j]) <= FLAGS_accuracy (0.201252 vs. 0.001)

http://ci.paddlepaddle.org/viewLog.html?buildId=75853&tab=buildLog&buildTypeId=Paddle_PrCi&logTab=tree&filter=all&_focus=25726 Is this issue related with #16316?

LeoZhao-Habana commented 5 years ago

use latest code and compile locally with same cmake options without GPU, run hundreds of test, can't reproduce.

BTW, this PR CI includes 2 changes, not sure if it is related, does it happen in night build? or happen just once ?

In PR CI build, it does building twice with different cmake options, some options are opposite, but only test the second building, is it normal?

cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_DSO=ON -DWITH_GPU=OFF -DWITH_AMD_GPU=OFF -DWITH_DISTRIBUTE=ON -DWITH_MKL=OFF -DWITH_NGRAPH=OFF -DWITH_AVX=ON -DWITH_GOLANG=OFF -DCUDA_ARCH_NAME=All -DWITH_PYTHON=ON -DCUDNN_ROOT=/usr/ -DWITH_TESTING=OFF -DCMAKE_MODULE_PATH=/opt/rocm/hip/cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DWITH_CONTRIB=ON -DWITH_INFERENCE_API_TEST=ON -DINFERENCE_DEMO_INSTALL_DIR=/root/.cache/inference_demo -DWITH_ANAKIN=ON -DANAKIN_BUILD_FAT_BIN= -DANAKIN_BUILD_CROSS_PLANTFORM= -DPY_VERSION=2.7 -DCMAKE_INSTALL_PREFIX=/paddle/build -DWITH_JEMALLOC=OFF -DWITH_GRPC=ON

cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_DSO=ON -DWITH_GPU=ON -DWITH_AMD_GPU=OFF -DWITH_DISTRIBUTE=ON -DWITH_MKL=ON -DWITH_NGRAPH=ON -DWITH_AVX=ON -DWITH_GOLANG=OFF -DCUDA_ARCH_NAME=Auto -DWITH_PYTHON=ON -DCUDNN_ROOT=/usr/ -DWITH_TESTING=ON -DCMAKE_MODULE_PATH=/opt/rocm/hip/cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DWITH_CONTRIB=ON -DWITH_INFERENCE_API_TEST=ON -DINFERENCE_DEMO_INSTALL_DIR=/root/.cache/inference_demo -DWITH_ANAKIN=ON -DANAKIN_BUILD_FAT_BIN= -DANAKIN_BUILD_CROSS_PLANTFORM= -DPY_VERSION=2.7 -DCMAKE_INSTALL_PREFIX=/paddle/build -DWITH_JEMALLOC=OFF -DWITH_GRPC=ON

luotao1 commented 5 years ago

does it happen in night build? or happen just once ?

NO, it happens on daytime.

but only test the second building, is it normal?

Yes, it is normal.

LeoZhao-Habana commented 5 years ago

keep monitoring please, I am running local test simultaneously.

LeoZhao-Habana commented 5 years ago

Reproduced on local machine with thousand cycles of test, looks like random failure, failures occur on both mkl and mkldnn compare, once even happened in first run. Trying to narrow down the rootcause with more logs

LeoZhao-Habana commented 5 years ago

From local test, confirmed it is related fc_fuse_pass, after remove this pass, this issue is gone. Next step is to narrow down on fc low level.

suspect transformer failure may have same rootcause with this.

jianhang-liu commented 5 years ago

After removing fc_fuse_pass, there is no any difference between naive run and predictor run. So no issue will happen since you are "comparing" the same thing.

LeoZhao-Habana commented 5 years ago

Yes, it is to confirm the issue is caused by fc_fuse_pass without other factors, e.g. feed/fetch. Then we concentrate on fc_fuse_pass.

LeoZhao-Habana commented 5 years ago

From first round investigation, the issue seems being in FCCompute in FC op:

    auto compute =
        jit::KernelFuncs<jit::VAddTuple<T>, platform::CPUPlace>::Cache().At(N);
#ifdef PADDLE_WITH_MKLML
#pragma omp parallel for
#endif
    for (int i = 0; i < M; i++) {
      T* dst = Y + i * N;
      compute(B, dst, dst, N);
    }

I replaced this part with simple loop implementation instead of jit, the issue seems gone, but still need double check to confirm, since this issue reproduce rate is much low, ~0.1%

jianhang-liu commented 5 years ago

@tensor-tang Leo's trial match my original suspect, i.e. something maybe wrong in JIT cache/selection in MKL OFF case.

LeoZhao-Habana commented 5 years ago

I am running more tests, if it proves true, I suspect transformer random failure has same rootcause on it, given running transformer takes longer time to reproduce. And I am planning to try to use blas vadd here to compare the result since original non-fuse case works as this way.

tensor-tang commented 5 years ago

Have you ever tried to remove #pragma omp parallel for ?

LeoZhao-Habana commented 5 years ago

not yet, I was planning to do that, but each test cycle costslong time, will put it into trial list.

LeoZhao-Habana commented 5 years ago

With thousands of tests with different options through weekend, the conclusion is "still need more tests to build up test trust base".

2 major issues found in long time stress tests (more than 10K tests):

1. Accuracy

Accuracy mismatch and exception from compare filter. Besides case failure, there is a potential issue in test framework, for defined accuracy gap threshold (0.001), it is based on absolute diff instead of percentage, suggest to adjust rules in tester_helper.h

2. fuse pass failure

exception happens between seqpool_concat_fuse_pass and seqconv_eltadd_relu_fuse_pass from log, seems it is in seqpool_concat_fuse_pass.

For small dam case, the test result shows:

Just from table analysis, few solid conclusions we can get are:

"MKL ON" impacts result much, MKL may be a key factor in this random issue.
Current code has highest reproduce rate.

But there is another potential issue we may ignore, we always trust Native path is correct and treat as reference, but what if it is not ?

Next steps to do:

Do >10K tests for removing fc_fuse_path to confirm result consistency
Add accuracy result print to compare native and analysis path
Re-run tests for MKL ON and OFF comparing

luotao1 commented 5 years ago

Thanks for your progress!

for defined accuracy gap threshold (0.001), it is based on absolute diff instead of the percentage, suggest to adjust rules in tester_helper.h

For the accuracy gap, we use absolute diff indeed. In CI, we use 1e-3, but for some special business line, we use 1e-5/1e-6.

LeoZhao-Habana commented 5 years ago

got it, current accuracy gap definition is based on predict precision.

tensor-tang commented 5 years ago

Thanks for your such detail analysis!

exception happens between seqpool_concat_fuse_pass and seqconv_eltadd_relu_fuse_pass from log, seems it is in seqpool_concat_fuse_pass.

What do you mean this? exception happens inside this pass or in the compute function?

LeoZhao-Habana commented 5 years ago

Thanks for your such detail analysis!

exception happens between seqpool_concat_fuse_pass and seqconv_eltadd_relu_fuse_pass from log, seems it is in seqpool_concat_fuse_pass.

What do you mean this? exception happens inside this pass or in the compute function?

This issue is not related with accuracy issue, but another issue found in testing. Just from log analysis, there was issue happening in seqpool_concat_fuse_pass for long time stress test.

luotao1 commented 5 years ago

but another issue found in testing. Just from log analysis

How about create a new issue and paste the log and log analysis?

LeoZhao-Habana commented 5 years ago

https://github.com/PaddlePaddle/Paddle/issues/16586, this is issue id.

LeoZhao-Habana commented 5 years ago

More test (over 12000 tests) results and investigation come out:

Confirmed by removing fc_fuse_pass or setting WITH_MKL=OFF, this issue is gone.
Confirmed the out from NativePredictor is always correct.

Another new finding is when WITH_MKL=ON and WITH_MKLDNN=OFF, this issue is gone. When WITH_MKL=ON, blas functions use dynamic-load to open libmkldnn_intel.so and get functions, while if WITH_MKLDNN=ON, libmkldnn_intel.so will be linked to program binary directly, it makes 2 different manners to use mkl functions, it is a potential point of conflict, but to be honest, there is still missing solid evidence to prove.

tensor-tang commented 5 years ago

Another new finding is when WITH_MKL=ON and WITH_MKLDNN=OFF, this issue is gone.

That's interesting.

Is there any potential issue about omp? Because both MKLDNN and MKL need this.

And we explicit dyload libmklml_intel.so but not iomp.so.

LeoZhao-Habana commented 5 years ago

when WITH_MKL=ON and WITH_MKLDNN=OFF, the program binary will report:

 error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory

I need to set LD_LIBRARY_PATH manually to let it run. it means iomp is not added into link list well since dlopen is used, I think this is another issue in makefile for cmake option "WITH_MKL=ON&&WITH_MKLDNN=OFF".

From code investigation, dam compare case should not be affected by MKLDNN since mkldnn fuse is not used at all, but it does impact the test result, something seems randomly wrong.

BTW, with normal cmake options, the dam compare nativepredictor output sticks to "0.740587", analysispredictor waves between "0.740587" and "0.740586", then random error happens; while the case when WITH_MKL=OFF or WITH_MKL=ON&&WITH_MKLDNN=OFF ( actually WITH_MKL=OFF also means WITH_MKLDNN=OFF), both predictors outputs stick to "0.740587" precisely.

LeoZhao-Habana commented 5 years ago

I am trying to change flag in dlopen, use "RTLD_GLOBAL" and "RTLD_NOW", any other suggestions ? @tensor-tang

tensor-tang commented 5 years ago

Actually iomp is linked here: https://github.com/PaddlePaddle/Paddle/blob/b75a69bad6ecc73a057faae3316d29a3a5c9386f/cmake/generic.cmake#L285

use "RTLD_GLOBAL" and "RTLD_NOW"

I am afraid it's not acceptable as a final solution since some gpu libs are used in this way as well, but you can try if it helps.

LeoZhao-Habana commented 5 years ago

Tried different ways on dlopen flag, no effect at all. Just realize that output waving between "0.740587" and "0.740586" is caused by mkldnn, for mkldnn compare, it always outputs "0.740586", mkl compare always outputs "0.740587".

wondering mkldnn execution may introduce random issue, testing with keeping normal cmake options but removing all mkldnn cases, and see if this issue will go.

LeoZhao-Habana commented 5 years ago

Actually iomp is linked here: Paddle/cmake/generic.cmake

Line 285 in b75a69b

target_link_libraries(${TARGET_NAME} "-L${MKLML_LIB_DIR} -liomp5 -Wl,--as-needed")

use "RTLD_GLOBAL" and "RTLD_NOW"

The reality is if set WITH_MKL=ON and WITH_MKLDNN=OFF in cmake, test tools will hit failure, seems this link rule doesn't work well in this case. will check this issue tomorrow if have time.

LeoZhao-Habana commented 5 years ago

night test proves that this random issue is introduced by mkldnn path, considering another fact that removing fc_fuse_pass this issue is gone also, the next step investigation will focus on fc_fuse_pass in mkldnn enabled case.

luotao1 commented 5 years ago

I find jit_kernel_test random fails at http://ci.paddlepaddle.org/viewLog.html?buildId=79730&tab=buildLog&buildTypeId=Paddle_PrCiReleasePython3514&logTab=tree&filter=all&_focus=14109

[01:32:45]112/540 Test  #91: jit_kernel_test .................................***Failed    9.09 sec
[01:32:45]WARNING: Logging before InitGoogleLogging() is written to STDERR
[01:32:45]W0403 01:32:36.980902 124724 init.cc:83] Cannot enable P2P access from 0 to 1
[01:32:45]W0403 01:32:36.981353 124724 init.cc:83] Cannot enable P2P access from 1 to 0
[01:32:45][==========] Running 48 tests from 4 test cases.
[01:32:45][----------] Global test environment set-up.
[01:32:45][----------] 4 tests from JITKernel_pool
[01:32:45][ RUN      ] JITKernel_pool.jitcreator
[01:32:45][       OK ] JITKernel_pool.jitcreator (0 ms)
[01:32:45][ RUN      ] JITKernel_pool.jitpool
[01:32:45][       OK ] JITKernel_pool.jitpool (0 ms)
[01:32:45][ RUN      ] JITKernel_pool.more
[01:32:45]/paddle/paddle/fluid/operators/jit/test.cc:1000: Failure
[01:32:45]      Expected: kers.size()
[01:32:45]      Which is: 10
[01:32:45]To be equal to: 8UL
[01:32:45]      Which is: 8

Is it related to this issue?

LeoZhao-Habana commented 5 years ago

I think this is another issue at least in small dam case. High stress test may expose new issues, e.g. run same case for more than 10K times.

From my multiple test results comparing, seems this random issue has its law, it binds with mkldnn execution and fc_fuse_pass, if we disable either of them, this issue is gone. mkldnn stack itself is ok if it doesn't run, I highly suspect mkldnn execution in somewhere interferes other cases, need to narrow down further.

LeoZhao-Habana commented 5 years ago

Narrow down test case sequence for reproducing random issue. In one test cycle (means in one process), placing mkldnn case (profile_mkldnn) before compare will trigger issue, it is obvious that mkldnn introduces issue.

luotao1 commented 5 years ago

Is https://github.com/PaddlePaddle/Paddle/issues/16609#issuecomment-479518016 related with this issue?

LeoZhao-Habana commented 5 years ago

@luotao1 Not sure, but worth to try.

Update for investigation result for 4/3:

Big probability to rule out mkldnn, since I tried to keep all test cases but disable mkldnn enabling in mkldnn related cases, I still saw accuracy issue once (reproduce rate is low, so still need double confirm).
Find another approach to reproduce more quickly. I add loop in dam test, e.g. in compare case,
```
for (auto i=0; i<10000; i++)
{
compare(true);    //mkldnn compare case
compare(false);   //non-mkldnn compare case
}
```
Just dedicate running this case, accuracy issue will happen soon if you are 'lucky'. But I ran this new case, I find another issue about mkldnn (I am good issue finder :(, need I raise new issue? ) , it sticks to somewhere when it runs 2-3 times, from log seems it hangs in mkldnn buffer transform, not spend much time to investigate. But a workaround is setting wrong mkldnn ops list, it forces all ops to non-mkldnn ops, then accuracy issue can be reproduced.

Next step, I will focus on fuse_pass related stuffs,

luotao1 commented 5 years ago

I find another issue about mkldnn (I am good issue finder :(, need I raise new issue? ) , it sticks to somewhere when it runs 2-3 times, from log seems it hangs in mkldnn buffer transform

LeoZhao-Habana commented 5 years ago

BTW, proving failure is much easy and quick, you just need to see one error, but proving success takes long time, at least >10K and 6+ hours test, so it is a little time consuming.

LeoZhao-Habana commented 5 years ago

I find another issue about mkldnn (I am good issue finder :(, need I raise new issue? ) , it sticks to somewhere when it runs 2-3 times, from log seems it hangs in mkldnn buffer transform

Maybe it is related to #15032 (comment)

Yes, it is same issue with #15032

tensor-tang commented 5 years ago

I find jit_kernel_test random fails at http://ci.paddlepaddle.org/viewLog.html?buildId=79730&tab=buildLog&buildTypeId=Paddle_PrCiReleasePython3514&logTab=tree&filter=all&_focus=14109

@luotao1 That's issue in release branch, not related to this one.

LeoZhao-Habana commented 5 years ago

potential related issue: https://github.com/PaddlePaddle/Paddle/issues/16688

LeoZhao-Habana commented 5 years ago

Bunch of tests point to patterns::FC in fc_fuse_pass, even remove new fc op creation in lambda, this issue is still there, it seems the issue is introduced by GraphPatternDetector.

LeoZhao-Habana commented 5 years ago

issue seems in patterns::FC

PDNode *patterns::FC::operator()(paddle::framework::ir::PDNode *x,
                                 bool with_bias) {
  // Create shared nodes.
  x->assert_is_op_input("mul", "X");
  auto *mul = pattern->NewNode(mul_repr())->assert_is_op("mul");

  auto *mul_w_var = pattern->NewNode(w_repr())
                        ->AsInput()
                        ->assert_is_persistable_var()
                        ->assert_is_op_input("mul", "Y");

  **auto *mul_out_var =
      pattern->NewNode(mul_out_repr())->assert_is_op_output("mul");**

  if (!with_bias) {  // not with bias
    // Add links.
    mul->LinksFrom({x, mul_w_var}).LinksTo({mul_out_var});
    return mul_out_var;

"mul_out_var" line is the edge for the issue, but it is strange that there is no obvious defect there.

LeoZhao-Habana commented 5 years ago

PR #16756 fixed this issue by my test. From rootcause analaysis, seems it is not related with fuse pass itself, and not in respect of op create/delete and graph change. From patch, in the final step "graph to program" , unordered_map/set introduced random.

But some cases still can't be well explained, e.g. why removing fc fuse pass makes this issue totally gone, commenting out some code lines makes this issue reproduced or gone in even no graph changes scenario.

PaddlePaddle / Paddle

test_analyzer_small_dam random fails on accuracy #16473

1. Accuracy

2. fuse pass failure