alnfedorov / lowbitdnn-project

Lowbit integer arithmetic to speed up CNN inference and training.
1 stars 1 forks source link

Primitives design #2

Open alnfedorov opened 5 years ago

alnfedorov commented 5 years ago

It's a well-known fact that many convolutions can be thought of as a direct matrix multiplication(Im2Col and more subtle ideas). cuDNN white-paper directly states that NVIDIA developers use precisely this approach. Small reverse engineering reveals that NVIDIA developed a set of the architecture-specific matrix multiplication kernels tuned for specific data shapes. And many convolutions are mapped to these basic blocks via some heuristics. Perhaps, it is a good idea to follow the same path and thought of any convolution as a bunch of general matrix multiplications routines.

On the other side, it is fruitful to apply direct convolutions for CPU. Thanks to CUDA ideology, it is enough to implement convolutions just for the smallest data shape. Further scaling will be done by adjusting grid parameters, and additional tuning is also possible by providing more specific kernels implementations.

The question is, what is more flexible? Bunch of the matrix multiplication kernels + logic to turn it into convolutions or convolution kernels + logic to turn it into matrix multiplication(for RNN or fully-connected).

alnfedorov commented 5 years ago

In any case, this is going to be a vast boilerplate, with a subtle difference in the shared memory or grid configurations. It is okay for the prototype and single micro-architecture, yet further development requires either abstracting the pipeline or using existed libraries, like CUTLASS.

Perhaps it's even better for performance to create JIT-compiled CUTLASS-based templates (like nervana convolutions).

alnfedorov commented 5 years ago

Other options are various domain-specific languages ​​and deep learning compilers. The idea is to extend their primitives for low-bit inference-training:

  1. TVM - writing layer via python-based proxy language + autotune.
  2. Tensor comprehensions (looks like it is dropped) and PlaidML. New languages ​​for various deep learning custom layers. Not sure that it will be simple to implement custom logic with custom data types also there is no support for Nvidia tensor cores.
  3. ngraph - PlaidML + vendor kernels calls + graph simplification. Didn't find any options to integrate new operations easily.
  4. glow - not that simple to catch all the ideas. Similar to TVM, needs a closer look.
  5. halide - yet another language to simplify high-performance computational pipelines. Interesting, especially given the reported performance gain over PyTorch for some special case filter but it seems to lack support of the vendor specific-features, like tensor cores.

TVM and glow require a closer look to better understand the pros and cons.

PlaidML, Halide, and TensorComprehensions are not suitable for my use case. I didn't find any way to use tensor cores on the Nvidia Turing micro-architecture. Intel ngraph framework is not an option either, as it is more like routing kernel calls to the supported libraries-backends.

alnfedorov commented 5 years ago

It is clear that PyTorch(target framework, but not the only) is moving toward a hybrid computational graph, where jit traced subgraphs are compiled on the fly(tvm or glow). Also, PyTorch is about to introduce first-tier support for the quantization in general(same for tvm) what plays nicely with the idea of subgraphs compilation. So, that supports the idea of implementing low bit operations directly in the tvm/glow primitives.

PyTorch quantization design document is here

PyTorch->TVM project and PyTorch->Glow project.

alnfedorov commented 5 years ago

Choosing between TVM and Glow, TVM has better code generation support so far(Tensorization + Vectorization + CPU + Nvidia GPU + ARM + etc). As far as I understand, Glow uses independent graph transformations + vendor-specific backends that implements basic linear algebra blocks. Hence, it is not that simple to fuse operators or make more advanced optimizations with Glow. What is even worse, Glow does not support Cuda as a backend, OpenCL only. Glow cannot autotune single algorithm for various scenarios, as TVM does(write once, tune many).

For now, between TVM and Glow I would choose TVM.

alnfedorov commented 5 years ago

TVM or Cuda + CUTLASS? Simple question, definitely TVM because of the autotuning and code generation opportunities. It is not that simple to transfer TVM functions to the PyTorch. I need either to help with implementation and integration of the whole quantization stack in TVM and PyTorch or to code-gen Cuda code from TVM function and wrap it as a PyTorch module. In any way, it is a good idea to write and optimize TVM code first. And then decide how to proceed.

Original problem remains: either to express convolutions as matrix multiplication or not.

alnfedorov commented 5 years ago

References:

  1. TVM tensor cores header issue
  2. TVM dp4a optimizations
alnfedorov commented 5 years ago

Problems found:

  1. To interact with each other TVM and PyTorch use DLPack. And dlpack doesn't support quantization at the moment. Another obstacle for transparent quantization integration.
alnfedorov commented 5 years ago
alnfedorov commented 4 years ago

When implementing backward pass wrt to the data, it is necessary to express computations as a simple convolution (not transposed). There are many reasons for this. The simplest is that we can't express transposed convolution result based on the universal indexing rule (primary tvm approach).