Open chengtbf opened 4 years ago
We can record the called times of Conv2d algorithm selecting for each compiling. I remember that we do have a cache for it. Log the shape info to figure out whether the cache is broken.
There's no evidence that compile taking too long time is caused by cudnn algorithm inferring of conv2d. I will do some test to check whether cudnn algorithm inferring works incorrectly or the cache of inferring result is broken.
There's no evidence that compile taking too long time is caused by cudnn algorithm inferring of conv2d. I will do some test to check whether cudnn algorithm inferring works incorrectly or the cache of inferring result is broken.
show data of cost time?
Through the profile logs, I found the most reason which slows down the compile time is the InferTmpSize
of batch_normalization. It's called in UserOp::InferBlobDescs
for many times when OneFlow app is running on multiple cards(or nodes).
I will add cache for the InferTmpSize
of batch_normalization, just like the way we do in the cudnn_conv_util.
show data of cost time?
I make the statistics of InferTmpSize time for every conv ops (include conv_data_grad and conv_filter_grad). The total result is as below:
total compile time | total InferTmpSize time of conv ops | proportion | |
---|---|---|---|
1n1c | 38.790431 | 3.85044 | 9.926% |
1n4c | 84.266890 | 3.764833 | 4.467% |
2n8c | 149.072857 | 4.318920 | 2.897% |
Note: It's non-abnormal that the total time of InferTmpSize with 1n4c is less than with 1n1c, because other tasks may impact in our development servers.
Through the proportion of total time of InferTmpSize in the total time of compile, we can know the InferTmpSize of conv ops is not the most influential factor of growing of compile time.
According to the details of statistics (too large to be fully listed in this comment), we will find that the elapsed time of the first InferTmpSize call is several orders of magnitude larger than the succeeding call of the same conv op, which can prove that the cache mechanism for InferTmpSize of conv ops works correctly.
Through the profile logs, I found the most reason which slows down the compile time is the
InferTmpSize
of batch_normalization. It's called inUserOp::InferBlobDescs
for many times when OneFlow app is running on multiple cards(or nodes). I will add cache for theInferTmpSize
of batch_normalization, just like the way we do in the cudnn_conv_util.
Why need InferTmpSize in op graph?It should be used only in task graph build for kernel。Op graph ONLY need infer out blob descs,NOT for tmp blob descs
Why need InferTmpSize in op graph?It should be used only in task graph build for kernel。Op graph ONLY need infer out blob descs,NOT for tmp blob descs
OpGraph::Init -> OpGraph::InferLogicalBlobDesc -> OpGraph::InferOpNodeLogicalBlobDesc -> Operator::InferBlobDescsIf -> UserOp::InferBlobDescs -> kernel_reg_val->infer_tmp_size_fn(&infer_ctx)
This is the call stack from OpGraph::Init
to InferTmpSize
, OneFlow will construct OpGraph
for each pass function in LazyJobBuildAndInferCtx::Complete
and each WithOpGraphAndMutJobBuilder
in JobCompleter::Complete
, so
the InferTmpSize
will be called repeatedly many times.
You mean that OpGraph::InferOpNodeLogicalBlobDesc
should call InferOutBlobDescsIf
instead of InferBlobDescsIf
? @chengtbf
Why need InferTmpSize in op graph?It should be used only in task graph build for kernel。Op graph ONLY need infer out blob descs,NOT for tmp blob descs
OpGraph::Init -> OpGraph::InferLogicalBlobDesc -> OpGraph::InferOpNodeLogicalBlobDesc -> Operator::InferBlobDescsIf -> UserOp::InferBlobDescs -> kernel_reg_val->infer_tmp_size_fn(&infer_ctx)
This is the call stack from
OpGraph::Init
toInferTmpSize
, OneFlow will constructOpGraph
for each pass function inLazyJobBuildAndInferCtx::Complete
and eachWithOpGraphAndMutJobBuilder
inJobCompleter::Complete
, so theInferTmpSize
will be called repeatedly many times.You mean that
OpGraph::InferOpNodeLogicalBlobDesc
should callInferOutBlobDescsIf
instead ofInferBlobDescsIf
? @chengtbf
Yes,LogicalBlobDesc don't need kernel tmp size
Try run resnet50 in 2 node 16 GPU,will take 15 minutes for Compiler time. But run BERT-base 4 node 32 GPU only need 5 minutes. We find the Conv2d op try run and select best algorithm will take very long time when GPU number is large.
Can we speed up the conv op select algorithm time? By Cache, or by multi thread + asynchronous.