Closed luotao1 closed 5 years ago
use latest code and compile locally with same cmake options without GPU, run hundreds of test, can't reproduce.
BTW, this PR CI includes 2 changes, not sure if it is related, does it happen in night build? or happen just once ?
In PR CI build, it does building twice with different cmake options, some options are opposite, but only test the second building, is it normal?
cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_DSO=ON -DWITH_GPU=OFF -DWITH_AMD_GPU=OFF -DWITH_DISTRIBUTE=ON -DWITH_MKL=OFF -DWITH_NGRAPH=OFF -DWITH_AVX=ON -DWITH_GOLANG=OFF -DCUDA_ARCH_NAME=All -DWITH_PYTHON=ON -DCUDNN_ROOT=/usr/ -DWITH_TESTING=OFF -DCMAKE_MODULE_PATH=/opt/rocm/hip/cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DWITH_CONTRIB=ON -DWITH_INFERENCE_API_TEST=ON -DINFERENCE_DEMO_INSTALL_DIR=/root/.cache/inference_demo -DWITH_ANAKIN=ON -DANAKIN_BUILD_FAT_BIN= -DANAKIN_BUILD_CROSS_PLANTFORM= -DPY_VERSION=2.7 -DCMAKE_INSTALL_PREFIX=/paddle/build -DWITH_JEMALLOC=OFF -DWITH_GRPC=ON
cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_DSO=ON -DWITH_GPU=ON -DWITH_AMD_GPU=OFF -DWITH_DISTRIBUTE=ON -DWITH_MKL=ON -DWITH_NGRAPH=ON -DWITH_AVX=ON -DWITH_GOLANG=OFF -DCUDA_ARCH_NAME=Auto -DWITH_PYTHON=ON -DCUDNN_ROOT=/usr/ -DWITH_TESTING=ON -DCMAKE_MODULE_PATH=/opt/rocm/hip/cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DWITH_CONTRIB=ON -DWITH_INFERENCE_API_TEST=ON -DINFERENCE_DEMO_INSTALL_DIR=/root/.cache/inference_demo -DWITH_ANAKIN=ON -DANAKIN_BUILD_FAT_BIN= -DANAKIN_BUILD_CROSS_PLANTFORM= -DPY_VERSION=2.7 -DCMAKE_INSTALL_PREFIX=/paddle/build -DWITH_JEMALLOC=OFF -DWITH_GRPC=ON
does it happen in night build? or happen just once ?
NO, it happens on daytime.
but only test the second building, is it normal?
Yes, it is normal.
keep monitoring please, I am running local test simultaneously.
Reproduced on local machine with thousand cycles of test, looks like random failure, failures occur on both mkl and mkldnn compare, once even happened in first run. Trying to narrow down the rootcause with more logs
From local test, confirmed it is related fc_fuse_pass, after remove this pass, this issue is gone. Next step is to narrow down on fc low level.
suspect transformer failure may have same rootcause with this.
After removing fc_fuse_pass, there is no any difference between naive run and predictor run. So no issue will happen since you are "comparing" the same thing.
Yes, it is to confirm the issue is caused by fc_fuse_pass without other factors, e.g. feed/fetch. Then we concentrate on fc_fuse_pass.
From first round investigation, the issue seems being in FCCompute in FC op:
auto compute =
jit::KernelFuncs<jit::VAddTuple<T>, platform::CPUPlace>::Cache().At(N);
#ifdef PADDLE_WITH_MKLML
#pragma omp parallel for
#endif
for (int i = 0; i < M; i++) {
T* dst = Y + i * N;
compute(B, dst, dst, N);
}
I replaced this part with simple loop implementation instead of jit, the issue seems gone, but still need double check to confirm, since this issue reproduce rate is much low, ~0.1%
@tensor-tang Leo's trial match my original suspect, i.e. something maybe wrong in JIT cache/selection in MKL OFF case.
I am running more tests, if it proves true, I suspect transformer random failure has same rootcause on it, given running transformer takes longer time to reproduce. And I am planning to try to use blas vadd here to compare the result since original non-fuse case works as this way.
Have you ever tried to remove #pragma omp parallel for
?
not yet, I was planning to do that, but each test cycle costslong time, will put it into trial list.
With thousands of tests with different options through weekend, the conclusion is "still need more tests to build up test trust base".
2 major issues found in long time stress tests (more than 10K tests):
Accuracy mismatch and exception from compare filter. Besides case failure, there is a potential issue in test framework, for defined accuracy gap threshold (0.001), it is based on absolute diff instead of percentage, suggest to adjust rules in tester_helper.h
exception happens between seqpool_concat_fuse_pass and seqconv_eltadd_relu_fuse_pass from log, seems it is in seqpool_concat_fuse_pass.
For small dam case, the test result shows:
Just from table analysis, few solid conclusions we can get are:
But there is another potential issue we may ignore, we always trust Native path is correct and treat as reference, but what if it is not ?
Next steps to do:
Thanks for your progress!
for defined accuracy gap threshold (0.001), it is based on absolute diff instead of the percentage, suggest to adjust rules in tester_helper.h
For the accuracy gap, we use absolute diff indeed. In CI, we use 1e-3, but for some special business line, we use 1e-5/1e-6.
got it, current accuracy gap definition is based on predict precision.
Thanks for your such detail analysis!
exception happens between seqpool_concat_fuse_pass and seqconv_eltadd_relu_fuse_pass from log, seems it is in seqpool_concat_fuse_pass.
What do you mean this? exception happens inside this pass or in the compute function?
Thanks for your such detail analysis!
exception happens between seqpool_concat_fuse_pass and seqconv_eltadd_relu_fuse_pass from log, seems it is in seqpool_concat_fuse_pass.
What do you mean this? exception happens inside this pass or in the compute function?
This issue is not related with accuracy issue, but another issue found in testing. Just from log analysis, there was issue happening in seqpool_concat_fuse_pass for long time stress test.
but another issue found in testing. Just from log analysis
How about create a new issue and paste the log and log analysis?
https://github.com/PaddlePaddle/Paddle/issues/16586, this is issue id.
More test (over 12000 tests) results and investigation come out:
Another new finding is when WITH_MKL=ON and WITH_MKLDNN=OFF, this issue is gone. When WITH_MKL=ON, blas functions use dynamic-load to open libmkldnn_intel.so and get functions, while if WITH_MKLDNN=ON, libmkldnn_intel.so will be linked to program binary directly, it makes 2 different manners to use mkl functions, it is a potential point of conflict, but to be honest, there is still missing solid evidence to prove.
Another new finding is when WITH_MKL=ON and WITH_MKLDNN=OFF, this issue is gone.
That's interesting.
Is there any potential issue about omp? Because both MKLDNN and MKL need this.
And we explicit dyload libmklml_intel.so
but not iomp.so
.
when WITH_MKL=ON and WITH_MKLDNN=OFF, the program binary will report:
error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory
I need to set LD_LIBRARY_PATH manually to let it run. it means iomp is not added into link list well since dlopen is used, I think this is another issue in makefile for cmake option "WITH_MKL=ON&&WITH_MKLDNN=OFF".
From code investigation, dam compare case should not be affected by MKLDNN since mkldnn fuse is not used at all, but it does impact the test result, something seems randomly wrong.
BTW, with normal cmake options, the dam compare nativepredictor output sticks to "0.740587", analysispredictor waves between "0.740587" and "0.740586", then random error happens; while the case when WITH_MKL=OFF or WITH_MKL=ON&&WITH_MKLDNN=OFF ( actually WITH_MKL=OFF also means WITH_MKLDNN=OFF), both predictors outputs stick to "0.740587" precisely.
I am trying to change flag in dlopen, use "RTLD_GLOBAL" and "RTLD_NOW", any other suggestions ? @tensor-tang
Actually iomp is linked here: https://github.com/PaddlePaddle/Paddle/blob/b75a69bad6ecc73a057faae3316d29a3a5c9386f/cmake/generic.cmake#L285
use "RTLD_GLOBAL" and "RTLD_NOW"
I am afraid it's not acceptable as a final solution since some gpu libs are used in this way as well, but you can try if it helps.
Tried different ways on dlopen flag, no effect at all. Just realize that output waving between "0.740587" and "0.740586" is caused by mkldnn, for mkldnn compare, it always outputs "0.740586", mkl compare always outputs "0.740587".
wondering mkldnn execution may introduce random issue, testing with keeping normal cmake options but removing all mkldnn cases, and see if this issue will go.
Actually iomp is linked here: Paddle/cmake/generic.cmake
Line 285 in b75a69b
target_link_libraries(${TARGET_NAME} "-L${MKLML_LIB_DIR} -liomp5 -Wl,--as-needed")
use "RTLD_GLOBAL" and "RTLD_NOW"
The reality is if set WITH_MKL=ON and WITH_MKLDNN=OFF in cmake, test tools will hit failure, seems this link rule doesn't work well in this case. will check this issue tomorrow if have time.
night test proves that this random issue is introduced by mkldnn path, considering another fact that removing fc_fuse_pass this issue is gone also, the next step investigation will focus on fc_fuse_pass in mkldnn enabled case.
I find jit_kernel_test
random fails at
http://ci.paddlepaddle.org/viewLog.html?buildId=79730&tab=buildLog&buildTypeId=Paddle_PrCiReleasePython3514&logTab=tree&filter=all&_focus=14109
[01:32:45]112/540 Test #91: jit_kernel_test .................................***Failed 9.09 sec
[01:32:45]WARNING: Logging before InitGoogleLogging() is written to STDERR
[01:32:45]W0403 01:32:36.980902 124724 init.cc:83] Cannot enable P2P access from 0 to 1
[01:32:45]W0403 01:32:36.981353 124724 init.cc:83] Cannot enable P2P access from 1 to 0
[01:32:45][==========] Running 48 tests from 4 test cases.
[01:32:45][----------] Global test environment set-up.
[01:32:45][----------] 4 tests from JITKernel_pool
[01:32:45][ RUN ] JITKernel_pool.jitcreator
[01:32:45][ OK ] JITKernel_pool.jitcreator (0 ms)
[01:32:45][ RUN ] JITKernel_pool.jitpool
[01:32:45][ OK ] JITKernel_pool.jitpool (0 ms)
[01:32:45][ RUN ] JITKernel_pool.more
[01:32:45]/paddle/paddle/fluid/operators/jit/test.cc:1000: Failure
[01:32:45] Expected: kers.size()
[01:32:45] Which is: 10
[01:32:45]To be equal to: 8UL
[01:32:45] Which is: 8
Is it related to this issue?
I think this is another issue at least in small dam case. High stress test may expose new issues, e.g. run same case for more than 10K times.
From my multiple test results comparing, seems this random issue has its law, it binds with mkldnn execution and fc_fuse_pass, if we disable either of them, this issue is gone. mkldnn stack itself is ok if it doesn't run, I highly suspect mkldnn execution in somewhere interferes other cases, need to narrow down further.
Narrow down test case sequence for reproducing random issue. In one test cycle (means in one process), placing mkldnn case (profile_mkldnn) before compare will trigger issue, it is obvious that mkldnn introduces issue.
Is https://github.com/PaddlePaddle/Paddle/issues/16609#issuecomment-479518016 related with this issue?
@luotao1 Not sure, but worth to try.
Update for investigation result for 4/3:
for (auto i=0; i<10000; i++)
{
compare(true); //mkldnn compare case
compare(false); //non-mkldnn compare case
}
Just dedicate running this case, accuracy issue will happen soon if you are 'lucky'. But I ran this new case, I find another issue about mkldnn (I am good issue finder :(, need I raise new issue? ) , it sticks to somewhere when it runs 2-3 times, from log seems it hangs in mkldnn buffer transform, not spend much time to investigate. But a workaround is setting wrong mkldnn ops list, it forces all ops to non-mkldnn ops, then accuracy issue can be reproduced.
Next step, I will focus on fuse_pass related stuffs,
I find another issue about mkldnn (I am good issue finder :(, need I raise new issue? ) , it sticks to somewhere when it runs 2-3 times, from log seems it hangs in mkldnn buffer transform
Maybe it is related to https://github.com/PaddlePaddle/Paddle/issues/15032#issuecomment-455534148
BTW, proving failure is much easy and quick, you just need to see one error, but proving success takes long time, at least >10K and 6+ hours test, so it is a little time consuming.
I find another issue about mkldnn (I am good issue finder :(, need I raise new issue? ) , it sticks to somewhere when it runs 2-3 times, from log seems it hangs in mkldnn buffer transform
Maybe it is related to #15032 (comment)
Yes, it is same issue with #15032
I find jit_kernel_test random fails at http://ci.paddlepaddle.org/viewLog.html?buildId=79730&tab=buildLog&buildTypeId=Paddle_PrCiReleasePython3514&logTab=tree&filter=all&_focus=14109
@luotao1 That's issue in release branch, not related to this one.
potential related issue: https://github.com/PaddlePaddle/Paddle/issues/16688
Bunch of tests point to patterns::FC in fc_fuse_pass, even remove new fc op creation in lambda, this issue is still there, it seems the issue is introduced by GraphPatternDetector.
issue seems in patterns::FC
PDNode *patterns::FC::operator()(paddle::framework::ir::PDNode *x,
bool with_bias) {
// Create shared nodes.
x->assert_is_op_input("mul", "X");
auto *mul = pattern->NewNode(mul_repr())->assert_is_op("mul");
auto *mul_w_var = pattern->NewNode(w_repr())
->AsInput()
->assert_is_persistable_var()
->assert_is_op_input("mul", "Y");
**auto *mul_out_var =
pattern->NewNode(mul_out_repr())->assert_is_op_output("mul");**
if (!with_bias) { // not with bias
// Add links.
mul->LinksFrom({x, mul_w_var}).LinksTo({mul_out_var});
return mul_out_var;
"mul_out_var" line is the edge for the issue, but it is strange that there is no obvious defect there.
PR #16756 fixed this issue by my test. From rootcause analaysis, seems it is not related with fuse pass itself, and not in respect of op create/delete and graph change. From patch, in the final step "graph to program" , unordered_map/set introduced random.
But some cases still can't be well explained, e.g. why removing fc fuse pass makes this issue totally gone, commenting out some code lines makes this issue reproduced or gone in even no graph changes scenario.
http://ci.paddlepaddle.org/viewLog.html?buildId=75853&tab=buildLog&buildTypeId=Paddle_PrCi&logTab=tree&filter=all&_focus=25726 Is this issue related with #16316?