Too long compiler time when run CNN multi node

Oneflow-Inc / oneflow

OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.

http://www.oneflow.org

Apache License 2.0

5.9k stars 666 forks source link

Too long compiler time when run CNN multi node #3437

Open chengtbf opened 4 years ago

chengtbf commented 4 years ago

Try run resnet50 in 2 node 16 GPU，will take 15 minutes for Compiler time. But run BERT-base 4 node 32 GPU only need 5 minutes. We find the Conv2d op try run and select best algorithm will take very long time when GPU number is large.

Can we speed up the conv op select algorithm time? By Cache, or by multi thread + asynchronous.

lixinqi commented 4 years ago

We can record the called times of Conv2d algorithm selecting for each compiling. I remember that we do have a cache for it. Log the shape info to figure out whether the cache is broken.

leaves-zwx commented 4 years ago

There's no evidence that compile taking too long time is caused by cudnn algorithm inferring of conv2d. I will do some test to check whether cudnn algorithm inferring works incorrectly or the cache of inferring result is broken.

chengtbf commented 4 years ago

There's no evidence that compile taking too long time is caused by cudnn algorithm inferring of conv2d. I will do some test to check whether cudnn algorithm inferring works incorrectly or the cache of inferring result is broken.

show data of cost time?

leaves-zwx commented 4 years ago

Through the profile logs, I found the most reason which slows down the compile time is the InferTmpSize of batch_normalization. It's called in UserOp::InferBlobDescs for many times when OneFlow app is running on multiple cards(or nodes). I will add cache for the InferTmpSize of batch_normalization, just like the way we do in the cudnn_conv_util.

leaves-zwx commented 4 years ago

show data of cost time?

I make the statistics of InferTmpSize time for every conv ops (include conv_data_grad and conv_filter_grad). The total result is as below:

	total compile time	total InferTmpSize time of conv ops	proportion
1n1c	38.790431	3.85044	9.926%
1n4c	84.266890	3.764833	4.467%
2n8c	149.072857	4.318920	2.897%

Note: It's non-abnormal that the total time of InferTmpSize with 1n4c is less than with 1n1c, because other tasks may impact in our development servers.

Through the proportion of total time of InferTmpSize in the total time of compile, we can know the InferTmpSize of conv ops is not the most influential factor of growing of compile time.

According to the details of statistics (too large to be fully listed in this comment), we will find that the elapsed time of the first InferTmpSize call is several orders of magnitude larger than the succeeding call of the same conv op, which can prove that the cache mechanism for InferTmpSize of conv ops works correctly.

chengtbf commented 4 years ago

Through the profile logs, I found the most reason which slows down the compile time is the InferTmpSize of batch_normalization. It's called in UserOp::InferBlobDescs for many times when OneFlow app is running on multiple cards(or nodes). I will add cache for the InferTmpSize of batch_normalization, just like the way we do in the cudnn_conv_util.

Why need InferTmpSize in op graph？It should be used only in task graph build for kernel。Op graph ONLY need infer out blob descs，NOT for tmp blob descs

leaves-zwx commented 4 years ago

Why need InferTmpSize in op graph？It should be used only in task graph build for kernel。Op graph ONLY need infer out blob descs，NOT for tmp blob descs

OpGraph::Init -> OpGraph::InferLogicalBlobDesc -> OpGraph::InferOpNodeLogicalBlobDesc -> Operator::InferBlobDescsIf -> UserOp::InferBlobDescs -> kernel_reg_val->infer_tmp_size_fn(&infer_ctx)

This is the call stack from OpGraph::Init to InferTmpSize, OneFlow will construct OpGraph for each pass function in LazyJobBuildAndInferCtx::Complete and each WithOpGraphAndMutJobBuilder in JobCompleter::Complete, so the InferTmpSize will be called repeatedly many times.

You mean that OpGraph::InferOpNodeLogicalBlobDesc should call InferOutBlobDescsIf instead of InferBlobDescsIf? @chengtbf

chengtbf commented 4 years ago

Why need InferTmpSize in op graph？It should be used only in task graph build for kernel。Op graph ONLY need infer out blob descs，NOT for tmp blob descs
OpGraph::Init -> OpGraph::InferLogicalBlobDesc -> OpGraph::InferOpNodeLogicalBlobDesc -> Operator::InferBlobDescsIf -> UserOp::InferBlobDescs -> kernel_reg_val->infer_tmp_size_fn(&infer_ctx)
This is the call stack from OpGraph::Init to InferTmpSize, OneFlow will construct OpGraph for each pass function in LazyJobBuildAndInferCtx::Complete and each WithOpGraphAndMutJobBuilder in JobCompleter::Complete, so the InferTmpSize will be called repeatedly many times.

You mean that OpGraph::InferOpNodeLogicalBlobDesc should call InferOutBlobDescsIf instead of InferBlobDescsIf? @chengtbf

Yes，LogicalBlobDesc don't need kernel tmp size