exception occurs when do BERT training on CPU with transpose mkldnn op enabled

LeoZhao-Habana commented 4 years ago

I met a issue related with mkldnn when I tried to enable transpose mkldnn op in ParallelExecutor for BERT training.

throw exception in TransposeMKLDNNHandler:: AcquireTranspose()

PADDLE_ENFORCE((transpose_p != nullptr) || (is_reusing_ == false),
                   "Fail to find convolution primitive in device context");

For current implementation in Compute(), a hash key is used for create TransposeMKLDNNHandler, and this key is generated by

const std::string key = platform::TransposeMKLDNNHandler::GetHash(
        nchw_tz, axis, ctx.op().Output(framework::GradVarName("X")));

This method is ok for common case in Executor or NativeExecutor since the only instance for one op, and even for multi-instances inference case, SetMkldnnThreadID() is designed for identifying op. But for ParallelExecutor case, it is specific, it duplicates program execution by user defined instance number, it looks like you have multiple GPU cards in one machine. Let's come to the root cause for my case, for ParallelExecutor on CPU, no place to SetMkldnnThreadID() to identify different mkldnn op instances on same program, simply speaking, the key generation method is not enough anymore. For my current temporary solution, I add kernel instance pointer 'this' into hash key, so it can be simplified with just 'this', and then it works. I think it is not a issue just on Transpose, it may be a generic issue for all mkldnn ops on ParallelExecutor.

LeoZhao-Habana commented 4 years ago

@jczaja

luotao1 commented 4 years ago

We meet the same problem on content_dnn model. We use parallel_executor for training.

We enable transpose2/transpose2_grad mkldnn op for one thread training, it runs OK.
We enable transpose2/transpose2_grad mkldnn op for 16 threads training, it runs Crash.
We disable transpose2/transpose2_grad mkldnn op for 16 threads training, it runs OK.

Thus, we guess is there something wrong in transpose mkldnn op?

For my current temporary solution, I add kernel instance pointer 'this' into hash key, so it can be simplified with just 'this', and then it works

@LeoZhao-Intel Could you give the git diff for your update? And we can test it for content_dnn at the same time.

LeoZhao-Habana commented 4 years ago

A temporary fix, just for test.

diff --git a/paddle/fluid/operators/mkldnn/transpose_mkldnn_op.cc b/paddle/fluid/operators/mkldnn/transpose_mkldnn_op.cc
index 95cee80..b52a8c6 100644
--- a/paddle/fluid/operators/mkldnn/transpose_mkldnn_op.cc
+++ b/paddle/fluid/operators/mkldnn/transpose_mkldnn_op.cc
@@ -16,6 +16,7 @@
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/memory/malloc.h"
 #include "paddle/fluid/platform/mkldnn_reuse.h"
+#include <sstream>

 namespace paddle {
 namespace operators {
@@ -45,8 +46,11 @@ class TransposeMKLDNNOpKernel : public paddle::framework::OpKernel<T> {

     std::vector<int> nchw_tz = paddle::framework::vectorize2int(input->dims());

+    std::stringstream ss;
+    ss << this;
+
     const std::string key = platform::TransposeMKLDNNHandler::GetHash(
-        nchw_tz, axis, ctx.op().Output("Out"));
+        nchw_tz, axis, ctx.op().Output("Out") + ss.str());

     platform::TransposeMKLDNNHandler handler(nchw_tz, axis, dev_ctx,
                                              mkldnn_engine, key);
@@ -122,8 +126,10 @@ class TransposeMKLDNNGradOpKernel : public paddle::framework::OpKernel<T> {
     std::vector<int> nchw_tz =
         paddle::framework::vectorize2int(out_grad->dims());

+    std::stringstream ss;
+    ss << this;
     const std::string key = platform::TransposeMKLDNNHandler::GetHash(
-        nchw_tz, axis, ctx.op().Output(framework::GradVarName("X")));
+        nchw_tz, axis, ctx.op().Output(framework::GradVarName("X"))+ss.str());

     platform::TransposeMKLDNNHandler handler(nchw_tz, reversed_axis, dev_ctx,
                                              mkldnn_engine, key);

LeoZhao-Habana commented 4 years ago

@luotao1 BTW, can you share the method to enable/disable mkldnn for specific op in python ? thx

luotao1 commented 4 years ago

please see #17341

LeoZhao-Habana commented 4 years ago

softmax mkldnn kernel also has such issue.

luotao1 commented 4 years ago

content_dnn don't have such issue on softmax mkldnn.

LeoZhao-Habana commented 4 years ago

increase device_count , e.g. change CPU_NUM to big, it may happen

luotao1 commented 4 years ago

@LeoZhao-Intel @jczaja I test the temporary fix in https://github.com/PaddlePaddle/Paddle/issues/17611#issuecomment-495454303 on content_dnn 16 threads training, it runs OK. Do you have a plan to fix it? Using this temporary fix, content_dnn could speedup 17% with enable transpose_mkldnn_op.

LeoZhao-Habana commented 4 years ago

@luotao1 For Executor, is there similar interface with ParallelExecutor which can be used for mkldnn op list setting? thx

luotao1 commented 4 years ago

No, there is no similar interface for Executor.

LeoZhao-Habana commented 4 years ago

so that means no way from python level to set mkldnn op list?

luotao1 commented 4 years ago

For executor, there is no such interface now, since it doesn't support pass now.

LeoZhao-Habana commented 4 years ago

got it, thx

LeoZhao-Habana commented 4 years ago

@jczaja for how to setup env, please refer https://docs.google.com/document/d/1luZg1v2_leUaV7nQhXqJvgzLIEsU-OAVNiIyDuo88dM,

run script is shared here. run_classifier_n.zip

goto NeuralMachineTranslation/BERT/fluid/train/LARK_Paddle_BERT/BERT

cmd to reproduce:

FLAGS_use_mkldnn=true python -u run_classifier_n.py --task_name XNLI --use_cuda false --do_train true --do_val true --do_test true --batch_size 1 --in_tokens False --init_pretraining_params  ../../chinese_L-12_H-768_A-12/params --data_dir ../../data --vocab_path ../../chinese_L-12_H-768_A-12/vocab.txt --checkpoints ../../save --save_steps 1000 --weight_decay 0.01 --warmup_proportion 0.1 --validation_steps 1000 --epoch 2 --max_seq_len 128 --bert_config_path ../../chinese_L-12_H-768_A-12/bert_config.json --learning_rate 5e-5 --skip_steps 10 --random_seed 1

you can modify mkldnn op list to enable/disable op in script.

jianhang-liu commented 4 years ago

@jczaja will take time to continue the investigation and give formal fix. @LeoZhao-Intel Please help Jacek to reproduce this issue. Thanks.

jczaja commented 4 years ago

Some update:

I have reproduced problem eg. crash in Transpose and Softmax of BERT training but not fixed it yet.

I'm investigating problem and have couple of questions regarding threads managements in paddle.

Device context is holding collection of maps . This collection (Map as well) is using cur_thread_id as a key to get another Map that is cache with Blobs(primitives , memory etc.) Problem is that when printed cur_thread_id it is always 0. So Paddle when using ParallelExecutor is using number of threads , but it does read and write to the same cache of Device context. This causes data races in Transpose2 Op .

I have made an experiement to initialize cur_thread_id with actual std::thread::id. This initialize made Paddle to have cache(Map) with Blobs to be maintained separately for each thread. This seems to work for Transpose Op , but there is a problem with Softmax Op.
Problem(crash) with softmax is that softmax fwd is sharing mkl-dnn PD to be used by softmax grad. And it happens that different threads are executing softmax fwd and softmax grad . For example thread A when executing softmax op is storing PD in cache[A] and then thread B when executing softmax grad is expecting to find PD in cache[B] but it is not there as it is stored in cache[A].

questions: 1) Should there be one cache to be shared by parallel executor (situation like it is) or should some core parallel executor threads have their own cache? First approach (one cache for all threads of paddle) require thread safety mechanism to be inserted all over the MKL-DNN integration code, probably changing hash for data we do not want to share and lots of testing. Second approach means that each "task" of parallel executor has its own device_context cache eg. if parallel executor "task" is using multiple threads then those threads are to share device_context cache.

Potential fix could be calling setMkldnnThreadID/set_cur_thread_id with unqiue value for each "task" or ParallelExecutor and making sure that all threads working on "task" are storing /having the same cur_thread_id. That should fix the problem

Anyway, As I'm not familiar with PaddlePaddle ParallelExecutor and PoolThread then I will further look to that, but If you could answer my questions or explain architecture of ParallelExecutor in terms of threads management that would be very helpful.

LeoZhao-Habana commented 4 years ago

https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/concepts/parallel_executor.md This is introduction for parallel_executor. Current setMkldnnThreadID/set_cur_thread_id may not guarantee safety, since it just set a global variable, which highly depends on execution order and time for each task, for current inference, it will call setMkldnnThreadID before warmup then do warm up, after that switch to another instance, while ParallelExecutor almostly does tasks in same time.

chengduoZH commented 4 years ago

Should there be one cache to be shared by parallel executor (situation like it is) or should some core parallel executor threads have their own cache? First approach (one cache for all threads of paddle) require thread safety mechanism to be inserted all over the MKL-DNN integration code, probably changing hash for data we do not want to share and lots of testing. Second approach means that each "task" of parallel executor has its own device_context cache eg. if parallel executor "task" is using multiple threads then those threads are to share device_context cache.

@jczaja Because during the running of an operator, it firstly gets the DeviceContext from the DeviceContextPool, and DeviceContextPool stores the device context, i.e. CPUDeviceContext, MKLDNNDeviceContext, CUDADeviceContext. DeviceContextPool holds an std::map to store the relation of Place and DeviceContext. If you run paddle in a no-GPU machine, and set CPU_NUM=4, there is only one CPUDeviceContext and MKLDNNDeviceContext but not four in DeviceContextPool because the keys in the std::map are the same. So in Parallel, all the operators use the same MKLDNNDeviceContext now.

jczaja commented 4 years ago

@chengduoZH Thanks for conceptual explanation.

Status: MKL-DNN Reusing implementation is not thread safe, currently implementing prototype of making it thread safe eg. locks, critical sections etc. Once this work for BERT traninging via parallel executor then we will think how to make it working for all mkl-dnn ops.

LeoZhao-Habana commented 4 years ago

@chengduoZH , on behalf of this issue, I ask few questions about ParallelExecutor. 1) what's difference with use_fast_executor=true of false? from code, it is another low level executor, what's its relation with ThreadedSSAGraphExecutor? I see big performance delta between enabling it or disabling it. 2) for ParallelExecutor usage on training, is it in PaddlePaddle long term development plan? or simply speaking, is it recommended for users and will be widely used in the future?

jczaja commented 4 years ago

Added locks to Acquire mechanism (reusing implementations for Transpose, Softmax, conv2d , conv2d transpose and batch norm). Training runs for 1000 steps and then when validation is to happen there is a crash (mkl-dnn transpose2 op) on mismatch of input dims and axis dims. It works fine without mkl-dnn. It looks like there is another issue when using mkl-dnn implementation for training apart from thread safty
Tried to limit newly added critical sections to reduce over head, but run in functional problems.

@luotao1 Is it possible from within paddle(In operator level) to figure out which executor is used eg. NaiveExecutor , Executor or ParallelExecutor?

chengduoZH commented 4 years ago

what's difference with use_fast_executor=true of false? from code, it is another low level executor, what's its relation with ThreadedSSAGraphExecutor? I see big performance delta between enabling it or disabling it.

@LeoZhao-Intel Which one is fast? use_fast_executor=true of false. FastThreadedSSAGraphExecutor is an optimization based on ThreadedSSAGraphExecutor. The current test results show that FastThreadedSSAGraphExecutor runs faster on most data models than ThreadedSSAGraphExecutor. The ThreadedSSAGraphExecutor will be gradually removed.

for ParallelExecutor usage on training, is it in PaddlePaddle long term development plan? or simply speaking, is it recommended for users and will be widely used in the future?

ParallelExecutor can be used for training and testing now. And Parallel Executor will continue to be optimized in the future.

LeoZhao-Habana commented 4 years ago

@chengduoZH Thanks for your reply, use_fast_executor=true has almost 2 times of performance improvement, we are surprised to see this. It is good for us to understand your strategy and roadmap, then do optimization on Xeon accordingly.

luotao1 commented 4 years ago

Is it possible from within paddle(In operator level) to figure out which executor is used eg. NaiveExecutor , Executor or ParallelExecutor

@jczaja It is impossible in op level to figure out which executor is used. And an op doesn't need to know which executor will call it.

jczaja commented 4 years ago

@luotao1 Could you please run training of ContentDNN on following branch: https://github.com/jczaja/Paddle/tree/prv-mkldnn-reuse-mt

-If there is a failure please provide log (this is still work in progress , halpf of mkl-dnn ops were secured) -if it works fine then please tell me if speedup is visible.

jczaja commented 4 years ago

Status: critical sections were reduced (changes on branch URL as prviously). Currently trying to improve code (avoid copying locks all over). If all is fine Then first PR should be ready this week.

luotao1 commented 4 years ago

@jczaja Could you update the branch prv-mkldnn-reuse-mt? I meet compiler error after #17826 merged:

In file included from /content_dnn/Paddle/paddle/fluid/framework/data_layout_transform.cc:22:0:
/content_dnn/Paddle/paddle/fluid/platform/mkldnn_reuse.h: In member function ‘std::shared_ptr<mkldnn::memory> paddle::platform::ReorderMKLDNNHandler::AcquireSrcMemory(const mkldnn::memory::format&, void*)’:
/content_dnn/Paddle/paddle/fluid/platform/mkldnn_reuse.h:429:50: error: ‘is_reusing_’ was not declared in this scope
     PADDLE_ENFORCE((mem_p != nullptr) || (is_reusing_ == false),
                                                  ^
In file included from /content_dnn/Paddle/paddle/fluid/framework/data_layout_transform.cc:22:0:
/content_dnn/Paddle/paddle/fluid/platform/mkldnn_reuse.h:438:7: error: ‘is_reusing_’ was not declared in this scope
       is_reusing_ = true;
       ^
/content_dnn/Paddle/paddle/fluid/platform/mkldnn_reuse.h: In member function ‘std::shared_ptr<mkldnn::memory> paddle::platform::ReorderMKLDNNHandler::AcquireDstMemory(paddle::framework::Tensor*, const mkldnn::memory::format&, paddle::platform::Place)’:
/content_dnn/Paddle/paddle/fluid/platform/mkldnn_reuse.h:449:50: error: ‘is_reusing_’ was not declared in this scope
     PADDLE_ENFORCE((mem_p != nullptr) || (is_reusing_ == false),
                                                  ^
In file included from /content_dnn/Paddle/paddle/fluid/framework/data_layout_transform.cc:22:0:
/content_dnn/Paddle/paddle/fluid/platform/mkldnn_reuse.h:462:7: error: ‘is_reusing_’ was not declared in this scope
       is_reusing_ = true;
       ^
/content_dnn/Paddle/paddle/fluid/platform/mkldnn_reuse.h: In member function ‘std::shared_ptr<mkldnn::reorder> paddle::platform::ReorderMKLDNNHandler::AcquireReorder(std::shared_ptr<mkldnn::memory>, std::shared_ptr<mkldnn::memory>)’:
/content_dnn/Paddle/paddle/fluid/platform/mkldnn_reuse.h:473:54: error: ‘is_reusing_’ was not declared in this scope
     PADDLE_ENFORCE((reorder_p != nullptr) || (is_reusing_ == false),
                                                      ^
In file included from /content_dnn/Paddle/paddle/fluid/framework/data_layout_transform.cc:22:0:
/content_dnn/Paddle/paddle/fluid/platform/mkldnn_reuse.h:480:7: error: ‘is_reusing_’ was not declared in this scope
       is_reusing_ = true;
       ^
paddle/fluid/framework/CMakeFiles/data_layout_transform.dir/build.make:62: recipe for target 'paddle/fluid/framework/CMakeFiles/data_layout_transform.dir/data_layout_transform.cc.o' failed
make[2]: *** [paddle/fluid/framework/CMakeFiles/data_layout_transform.dir/data_layout_transform.cc.o] Error 1
CMakeFiles/Makefile2:4128: recipe for target 'paddle/fluid/framework/CMakeFiles/data_layout_transform.dir/all' failed
make[1]: *** [paddle/fluid/framework/CMakeFiles/data_layout_transform.dir/all] Error 2
Makefile:149: recipe for target 'all' failed
make: *** [all] Error 2

LeoZhao-Habana commented 4 years ago

@jczaja one comment for prv-mkldnn-reuse-mt branch. From your implementation, mkldnnhandler uses mutex to keep serial Acquire() action, but there may have one scenario:

Instance A acquired primitive X with Key K and set its data handle, and sending to mkldnn pipeline for execution
Instance B is acquiring primitive X with same Key K, and setting its data handle.

Then what will happen ? either primitive X's data handle for instance A is changed to B's, or crash ? Given this primitive object shared in one instance case is serially executed, but for multiple instances it is parallel execution, how to make sure execution order in mkldnn pipeline, it may be a problem, or use mutex to pretect execution? looks not efficient.

Come to initial problem, how to identify op from different execution instances with same graph? previously it has thread id, but in ParallelExecutor, there is no same way to set/get this thread id. So if we can identify instance and create primitive shared obj for each instance, that will be fine.

jczaja commented 4 years ago

@LeoZhao-Intel That is very good point. Currently it would be easier for me to make critical section around setting pointer to memory primitive and mkldnn execution. If there is noticable performance drop then The only option would be finding a way to cache instances' primitives/memory independently. I haven't found an parallel's member executor identification so I guess we would need to add one which means some changes into core functionality of paddle , so once solution with critical sections is not fast enough We will have to do it. Thanks for pointing out this serious issue.

jczaja commented 4 years ago

@chengduoZH Is ParallelExecutor to use for data parallelizm of model or is it also model parallelizm ?

jczaja commented 4 years ago

Extending work from branch: prv-mkldnn-reuse-mt to handle case pointed by @LeoZhao-Intel is difficult and will introduce overhead. Decided to try diffrent approach:

Extend keys with thread ID
For data that is produced by FWD and needed in Grad ops , data will be additionaly stored under key without TID . This will allow to avoid problem with set_data_handle as data usually stored in FWD for BWD ops is PD or workspace which does not used with set_data_handle() eg. we do not overwrite pointer, so it should work. There is one problem eg. activations ops that are sending Input memory directly to BWD and this has to be investigated deeper .

jczaja commented 4 years ago

@luotao1 Could you please run training of ContentDNN on following branch: https://github.com/jczaja/Paddle/tree/prv-mt-experiments

-If there is a failure please provide log (this is still work in progress ,Relu(activations) were not updated) -if it works fine then please tell me if speedup/slowdown is visible.

luotao1 commented 4 years ago

@jczaja @jianhang-liu Thanks very much for this PR! I have no time to test ContentDNN (I will test it ASAP), but this PR is needed for our face deployment service. When the service calls twice on the same model, there is a segment fault.

Could you create a PR for the fix of our face deployment service at first?

jczaja commented 4 years ago

@luotao1 Yes I will start preparing PR, but for full fix I need your help with #17960

jczaja commented 4 years ago

My understanding is that #17965 solved some issues , but introduced memory-leaks due to over-caching mkl-dnn data. I'm working on modification of this behaviour so that Parallel executor's use case is still cached per std::thread::id , but AnalysisPredictor (to be used in your service?) is cached per mkl-dnn thread ID (As set via SetMklDnnThreadId()). To finish PR (#18217) I need some more info on threads management of your service.

If within one hour, service is requested to predict 1000 images then you create 1000 std::threads will each of this thread call SetMklDnnThreadID(int n)? 2) If 1 is positive, then will SetMklDnnThreadID(int n) be called always with diffrent parameter eg. value ? For example for 1000 threads, will SetMklDnnThreadID be called with <1..1000> ? Or it depends when images are available , so in reality SetMklDnnThreadID() is called via values <1..100> ?

luotao1 commented 4 years ago

We have thread pools, for example, the pools have 100 threads, and we will re-use the threads.
Do you mean SetMklDnnThreadID(int n) set by users? We will set similar things in the config like config.SetMklDnnThreadID(int n), and at that time, users don't know the threadID.

LeoZhao-Habana commented 4 years ago

fixed by PR #17965

PaddlePaddle / Paddle

exception occurs when do BERT training on CPU with transpose mkldnn op enabled #17611