PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.13k stars 5.55k forks source link

MKLDNN: Fully Connected layer. #9197

Closed mozga-intel closed 6 years ago

mozga-intel commented 6 years ago

I will use the Fully Connected layer as an example to describe this problem. So, during the implementation of the Fully Connected layer with the use of MKLDNN algorithm, I have encountered a few difficulties. The current version of the fully connected layer of Paddle is splitting into two operations: multiplication and addition. Basically, these operations are used in the current version of Paddle. Subsequently, MKLDNN version of algorithm gives us the opportunity to combine these operations into one. So, If I wanted to kill two birds with one stone I should have made a new kernel to this layer. Thus, I should make a stand-alone version of FC's algorithm. However, when I implemented new kernel, I picked up a few problems:

Thank you.

dzhwinter commented 6 years ago
  • First of all, Am I forced to make three versions of the same algorithm on a CPU, GPU and MKLDNN, in order to register the new MKLDNN's op kernel?

No, You can just add a big operator of FC, and only implement the MKLDNN kernel.

  • Can I use the new Fc's kernel when I don't have a full implementation of FC's kernels on a CPU and GPU place, but I have only two fake kernels on CPU and GPU place?

If you are familiar with Eigen Tensor, implement a CPU/GPU FC kernel is similar to MKLDNN kernel. The user only call the kernel through the Python, we can fall back to small op combines(mul + addition) when there is no MKLDNN available.

  • Also, what can I do to link some of algorithms to one. Should we remove the old version of the algorithm (multiplication and sum) or should we replace this solution with a new algorithm (fully connected on MKLDNN) or is it not possible to touch it, and we need to add a new op kernel to the current solution?

I do not fully understand your point. The multiplication and sum operator is fundamental in algebra, it used everywhere. I think the FC kernel can not replace these two operators, it's just a speed up when you want to do the fully connected operation.

  • Can we have a special kernel only to one specific platform, i.e MKLDNN, without a need to register new kernel for other platform i.e CPU (naive) and GPU?

Yes. See the first comment.

dzhwinter commented 6 years ago

There is one point I want to clarify that why I didn't implement the FC CPU/GPU kernel. The kernel fusion https://arxiv.org/abs/1305.1183 https://www.tensorflow.org/performance/xla/jit is a big topic, combined small ops to a big one by hand is the old-fashioned way to do it. We can do some tricks to fuse the batch normalization or fully connected, but I thought that we need the general solution. Because

  1. You can not write kernel fusion for every platform. For example, today we have more than 10 kinds of mobile chips. If you choose kernel fusion by hand, take FC kernel and batch normal kernel for example, we have to implement 10 kinds of them, and the multiplication, addition operator(you need these two basis op everywhere, do you?). But if you choose small operators, say the multiplication and addition operation, we only need to port these two.

  2. Kernel fusion by hand will lead to explosion of op combination. Take FC kernel for example,
    fc kernel = mul + addition + activation, right? Then the general rule is New Kernel = Kernel A + Kernel B + .... if we can gain some benefits, we choose to combine the kernels in the right and generate a new kernel like the left, right? Then we can imagine, how many combination we will have, that will be a disaster if we go that way.

These two reason force TensorFlow team choose the XLA https://www.tensorflow.org/performance/xla/ way. But AFAIK, it will make debugging like a nightmare, because you can not imagine what happenend in your code.

We will follow the tvm or similar tech later. Currently the multi-node multi-gpu performance hurts, I am focuing on that topic.

luotao1 commented 6 years ago

You can just add a big operator of FC, and only implement the MKLDNN kernel

I agree with it. You can add fc_mkldnn_op.cc, fc_mkldnn_op.h, and modify the fc method in nn.py.

mozga-intel commented 6 years ago

@luotao1, @dzhwinter, Thank you very much.