Closed LeoZhao-Habana closed 4 years ago
@jczaja
We meet the same problem on content_dnn model. We use parallel_executor for training.
Thus, we guess is there something wrong in transpose mkldnn op?
For my current temporary solution, I add kernel instance pointer 'this' into hash key, so it can be simplified with just 'this', and then it works
@LeoZhao-Intel Could you give the git diff
for your update? And we can test it for content_dnn at the same time.
A temporary fix, just for test.
diff --git a/paddle/fluid/operators/mkldnn/transpose_mkldnn_op.cc b/paddle/fluid/operators/mkldnn/transpose_mkldnn_op.cc
index 95cee80..b52a8c6 100644
--- a/paddle/fluid/operators/mkldnn/transpose_mkldnn_op.cc
+++ b/paddle/fluid/operators/mkldnn/transpose_mkldnn_op.cc
@@ -16,6 +16,7 @@
#include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/memory/malloc.h"
#include "paddle/fluid/platform/mkldnn_reuse.h"
+#include <sstream>
namespace paddle {
namespace operators {
@@ -45,8 +46,11 @@ class TransposeMKLDNNOpKernel : public paddle::framework::OpKernel<T> {
std::vector<int> nchw_tz = paddle::framework::vectorize2int(input->dims());
+ std::stringstream ss;
+ ss << this;
+
const std::string key = platform::TransposeMKLDNNHandler::GetHash(
- nchw_tz, axis, ctx.op().Output("Out"));
+ nchw_tz, axis, ctx.op().Output("Out") + ss.str());
platform::TransposeMKLDNNHandler handler(nchw_tz, axis, dev_ctx,
mkldnn_engine, key);
@@ -122,8 +126,10 @@ class TransposeMKLDNNGradOpKernel : public paddle::framework::OpKernel<T> {
std::vector<int> nchw_tz =
paddle::framework::vectorize2int(out_grad->dims());
+ std::stringstream ss;
+ ss << this;
const std::string key = platform::TransposeMKLDNNHandler::GetHash(
- nchw_tz, axis, ctx.op().Output(framework::GradVarName("X")));
+ nchw_tz, axis, ctx.op().Output(framework::GradVarName("X"))+ss.str());
platform::TransposeMKLDNNHandler handler(nchw_tz, reversed_axis, dev_ctx,
mkldnn_engine, key);
@luotao1 BTW, can you share the method to enable/disable mkldnn for specific op in python ? thx
please see #17341
softmax mkldnn kernel also has such issue.
content_dnn don't have such issue on softmax mkldnn.
increase device_count , e.g. change CPU_NUM to big, it may happen
@LeoZhao-Intel @jczaja I test the temporary fix in https://github.com/PaddlePaddle/Paddle/issues/17611#issuecomment-495454303 on content_dnn 16 threads training, it runs OK. Do you have a plan to fix it? Using this temporary fix, content_dnn could speedup 17% with enable transpose_mkldnn_op.
@luotao1 For Executor, is there similar interface with ParallelExecutor which can be used for mkldnn op list setting? thx
No, there is no similar interface for Executor.
so that means no way from python level to set mkldnn op list?
For executor, there is no such interface now, since it doesn't support pass now.
got it, thx
@jczaja for how to setup env, please refer https://docs.google.com/document/d/1luZg1v2_leUaV7nQhXqJvgzLIEsU-OAVNiIyDuo88dM,
run script is shared here. run_classifier_n.zip
FLAGS_use_mkldnn=true python -u run_classifier_n.py --task_name XNLI --use_cuda false --do_train true --do_val true --do_test true --batch_size 1 --in_tokens False --init_pretraining_params ../../chinese_L-12_H-768_A-12/params --data_dir ../../data --vocab_path ../../chinese_L-12_H-768_A-12/vocab.txt --checkpoints ../../save --save_steps 1000 --weight_decay 0.01 --warmup_proportion 0.1 --validation_steps 1000 --epoch 2 --max_seq_len 128 --bert_config_path ../../chinese_L-12_H-768_A-12/bert_config.json --learning_rate 5e-5 --skip_steps 10 --random_seed 1
you can modify mkldnn op list to enable/disable op in script.
@jczaja will take time to continue the investigation and give formal fix. @LeoZhao-Intel Please help Jacek to reproduce this issue. Thanks.
Some update:
I have reproduced problem eg. crash in Transpose and Softmax of BERT training but not fixed it yet.
I'm investigating problem and have couple of questions regarding threads managements in paddle.
Device context is holding collection of maps . This collection (Map as well) is using cur_thread_id as a key to get another Map that is cache with Blobs(primitives , memory etc.) Problem is that when printed cur_thread_id it is always 0. So Paddle when using ParallelExecutor is using number of threads , but it does read and write to the same cache of Device context. This causes data races in Transpose2 Op .
questions: 1) Should there be one cache to be shared by parallel executor (situation like it is) or should some core parallel executor threads have their own cache? First approach (one cache for all threads of paddle) require thread safety mechanism to be inserted all over the MKL-DNN integration code, probably changing hash for data we do not want to share and lots of testing. Second approach means that each "task" of parallel executor has its own device_context cache eg. if parallel executor "task" is using multiple threads then those threads are to share device_context cache.
Potential fix could be calling setMkldnnThreadID/set_cur_thread_id with unqiue value for each "task" or ParallelExecutor and making sure that all threads working on "task" are storing /having the same cur_thread_id. That should fix the problem
Anyway, As I'm not familiar with PaddlePaddle ParallelExecutor and PoolThread then I will further look to that, but If you could answer my questions or explain architecture of ParallelExecutor in terms of threads management that would be very helpful.
https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/concepts/parallel_executor.md This is introduction for parallel_executor. Current setMkldnnThreadID/set_cur_thread_id may not guarantee safety, since it just set a global variable, which highly depends on execution order and time for each task, for current inference, it will call setMkldnnThreadID before warmup then do warm up, after that switch to another instance, while ParallelExecutor almostly does tasks in same time.
Should there be one cache to be shared by parallel executor (situation like it is) or should some core parallel executor threads have their own cache? First approach (one cache for all threads of paddle) require thread safety mechanism to be inserted all over the MKL-DNN integration code, probably changing hash for data we do not want to share and lots of testing. Second approach means that each "task" of parallel executor has its own device_context cache eg. if parallel executor "task" is using multiple threads then those threads are to share device_context cache.
@jczaja Because during the running of an operator, it firstly gets the DeviceContext from the DeviceContextPool
, and DeviceContextPool
stores the device context, i.e. CPUDeviceContext
, MKLDNNDeviceContext
, CUDADeviceContext
. DeviceContextPool
holds an std::map
to store the relation of Place and DeviceContext. If you run paddle in a no-GPU machine, and set CPU_NUM=4
, there is only one CPUDeviceContext
and MKLDNNDeviceContext
but not four in DeviceContextPool
because the keys in the std::map
are the same.
So in Parallel, all the operators use the same MKLDNNDeviceContext
now.
@chengduoZH Thanks for conceptual explanation.
Status: MKL-DNN Reusing implementation is not thread safe, currently implementing prototype of making it thread safe eg. locks, critical sections etc. Once this work for BERT traninging via parallel executor then we will think how to make it working for all mkl-dnn ops.
@chengduoZH , on behalf of this issue, I ask few questions about ParallelExecutor. 1) what's difference with use_fast_executor=true of false? from code, it is another low level executor, what's its relation with ThreadedSSAGraphExecutor? I see big performance delta between enabling it or disabling it. 2) for ParallelExecutor usage on training, is it in PaddlePaddle long term development plan? or simply speaking, is it recommended for users and will be widely used in the future?
@luotao1 Is it possible from within paddle(In operator level) to figure out which executor is used eg. NaiveExecutor , Executor or ParallelExecutor?
what's difference with use_fast_executor=true of false? from code, it is another low level executor, what's its relation with ThreadedSSAGraphExecutor? I see big performance delta between enabling it or disabling it.
@LeoZhao-Intel Which one is fast? use_fast_executor=true of false. FastThreadedSSAGraphExecutor
is an optimization based on ThreadedSSAGraphExecutor
. The current test results show that FastThreadedSSAGraphExecutor
runs faster on most data models than ThreadedSSAGraphExecutor
.
The ThreadedSSAGraphExecutor
will be gradually removed.
for ParallelExecutor usage on training, is it in PaddlePaddle long term development plan? or simply speaking, is it recommended for users and will be widely used in the future?
ParallelExecutor can be used for training and testing now. And Parallel Executor will continue to be optimized in the future.
@chengduoZH Thanks for your reply, use_fast_executor=true has almost 2 times of performance improvement, we are surprised to see this. It is good for us to understand your strategy and roadmap, then do optimization on Xeon accordingly.
Is it possible from within paddle(In operator level) to figure out which executor is used eg. NaiveExecutor , Executor or ParallelExecutor
@jczaja It is impossible in op level to figure out which executor is used. And an op doesn't need to know which executor will call it.
@luotao1 Could you please run training of ContentDNN on following branch: https://github.com/jczaja/Paddle/tree/prv-mkldnn-reuse-mt
-If there is a failure please provide log (this is still work in progress , halpf of mkl-dnn ops were secured) -if it works fine then please tell me if speedup is visible.
Status: critical sections were reduced (changes on branch URL as prviously). Currently trying to improve code (avoid copying locks all over). If all is fine Then first PR should be ready this week.
@jczaja Could you update the branch prv-mkldnn-reuse-mt
? I meet compiler error after #17826 merged:
In file included from /content_dnn/Paddle/paddle/fluid/framework/data_layout_transform.cc:22:0:
/content_dnn/Paddle/paddle/fluid/platform/mkldnn_reuse.h: In member function ‘std::shared_ptr<mkldnn::memory> paddle::platform::ReorderMKLDNNHandler::AcquireSrcMemory(const mkldnn::memory::format&, void*)’:
/content_dnn/Paddle/paddle/fluid/platform/mkldnn_reuse.h:429:50: error: ‘is_reusing_’ was not declared in this scope
PADDLE_ENFORCE((mem_p != nullptr) || (is_reusing_ == false),
^
In file included from /content_dnn/Paddle/paddle/fluid/framework/data_layout_transform.cc:22:0:
/content_dnn/Paddle/paddle/fluid/platform/mkldnn_reuse.h:438:7: error: ‘is_reusing_’ was not declared in this scope
is_reusing_ = true;
^
/content_dnn/Paddle/paddle/fluid/platform/mkldnn_reuse.h: In member function ‘std::shared_ptr<mkldnn::memory> paddle::platform::ReorderMKLDNNHandler::AcquireDstMemory(paddle::framework::Tensor*, const mkldnn::memory::format&, paddle::platform::Place)’:
/content_dnn/Paddle/paddle/fluid/platform/mkldnn_reuse.h:449:50: error: ‘is_reusing_’ was not declared in this scope
PADDLE_ENFORCE((mem_p != nullptr) || (is_reusing_ == false),
^
In file included from /content_dnn/Paddle/paddle/fluid/framework/data_layout_transform.cc:22:0:
/content_dnn/Paddle/paddle/fluid/platform/mkldnn_reuse.h:462:7: error: ‘is_reusing_’ was not declared in this scope
is_reusing_ = true;
^
/content_dnn/Paddle/paddle/fluid/platform/mkldnn_reuse.h: In member function ‘std::shared_ptr<mkldnn::reorder> paddle::platform::ReorderMKLDNNHandler::AcquireReorder(std::shared_ptr<mkldnn::memory>, std::shared_ptr<mkldnn::memory>)’:
/content_dnn/Paddle/paddle/fluid/platform/mkldnn_reuse.h:473:54: error: ‘is_reusing_’ was not declared in this scope
PADDLE_ENFORCE((reorder_p != nullptr) || (is_reusing_ == false),
^
In file included from /content_dnn/Paddle/paddle/fluid/framework/data_layout_transform.cc:22:0:
/content_dnn/Paddle/paddle/fluid/platform/mkldnn_reuse.h:480:7: error: ‘is_reusing_’ was not declared in this scope
is_reusing_ = true;
^
paddle/fluid/framework/CMakeFiles/data_layout_transform.dir/build.make:62: recipe for target 'paddle/fluid/framework/CMakeFiles/data_layout_transform.dir/data_layout_transform.cc.o' failed
make[2]: *** [paddle/fluid/framework/CMakeFiles/data_layout_transform.dir/data_layout_transform.cc.o] Error 1
CMakeFiles/Makefile2:4128: recipe for target 'paddle/fluid/framework/CMakeFiles/data_layout_transform.dir/all' failed
make[1]: *** [paddle/fluid/framework/CMakeFiles/data_layout_transform.dir/all] Error 2
Makefile:149: recipe for target 'all' failed
make: *** [all] Error 2
@jczaja one comment for prv-mkldnn-reuse-mt branch. From your implementation, mkldnnhandler uses mutex to keep serial Acquire() action, but there may have one scenario:
Then what will happen ? either primitive X's data handle for instance A is changed to B's, or crash ? Given this primitive object shared in one instance case is serially executed, but for multiple instances it is parallel execution, how to make sure execution order in mkldnn pipeline, it may be a problem, or use mutex to pretect execution? looks not efficient.
Come to initial problem, how to identify op from different execution instances with same graph? previously it has thread id, but in ParallelExecutor, there is no same way to set/get this thread id. So if we can identify instance and create primitive shared obj for each instance, that will be fine.
@LeoZhao-Intel That is very good point. Currently it would be easier for me to make critical section around setting pointer to memory primitive and mkldnn execution. If there is noticable performance drop then The only option would be finding a way to cache instances' primitives/memory independently. I haven't found an parallel's member executor identification so I guess we would need to add one which means some changes into core functionality of paddle , so once solution with critical sections is not fast enough We will have to do it. Thanks for pointing out this serious issue.
@chengduoZH Is ParallelExecutor to use for data parallelizm of model or is it also model parallelizm ?
Extending work from branch: prv-mkldnn-reuse-mt to handle case pointed by @LeoZhao-Intel is difficult and will introduce overhead. Decided to try diffrent approach:
Extend keys with thread ID
For data that is produced by FWD and needed in Grad ops , data will be additionaly stored under key without TID . This will allow to avoid problem with set_data_handle as data usually stored in FWD for BWD ops is PD or workspace which does not used with set_data_handle() eg. we do not overwrite pointer, so it should work. There is one problem eg. activations ops that are sending Input memory directly to BWD and this has to be investigated deeper .
@luotao1 Could you please run training of ContentDNN on following branch: https://github.com/jczaja/Paddle/tree/prv-mt-experiments
-If there is a failure please provide log (this is still work in progress ,Relu(activations) were not updated) -if it works fine then please tell me if speedup/slowdown is visible.
@jczaja @jianhang-liu Thanks very much for this PR! I have no time to test ContentDNN (I will test it ASAP), but this PR is needed for our face deployment service. When the service calls twice on the same model, there is a segment fault.
Could you create a PR for the fix of our face deployment service at first?
@luotao1 Yes I will start preparing PR, but for full fix I need your help with #17960
My understanding is that #17965 solved some issues , but introduced memory-leaks due to over-caching mkl-dnn data. I'm working on modification of this behaviour so that Parallel executor's use case is still cached per std::thread::id , but AnalysisPredictor (to be used in your service?) is cached per mkl-dnn thread ID (As set via SetMklDnnThreadId()). To finish PR (#18217) I need some more info on threads management of your service.
SetMklDnnThreadID(int n)
set by users? We will set similar things in the config like config.SetMklDnnThreadID(int n)
, and at that time, users don't know the threadID.fixed by PR #17965
I met a issue related with mkldnn when I tried to enable transpose mkldnn op in ParallelExecutor for BERT training.
throw exception in TransposeMKLDNNHandler:: AcquireTranspose()
For current implementation in Compute(), a hash key is used for create TransposeMKLDNNHandler, and this key is generated by
This method is ok for common case in Executor or NativeExecutor since the only instance for one op, and even for multi-instances inference case, SetMkldnnThreadID() is designed for identifying op. But for ParallelExecutor case, it is specific, it duplicates program execution by user defined instance number, it looks like you have multiple GPU cards in one machine. Let's come to the root cause for my case, for ParallelExecutor on CPU, no place to SetMkldnnThreadID() to identify different mkldnn op instances on same program, simply speaking, the key generation method is not enough anymore. For my current temporary solution, I add kernel instance pointer 'this' into hash key, so it can be simplified with just 'this', and then it works. I think it is not a issue just on Transpose, it may be a generic issue for all mkldnn ops on ParallelExecutor.