Open alnfedorov opened 5 years ago
In any case, this is going to be a vast boilerplate, with a subtle difference in the shared memory or grid configurations. It is okay for the prototype and single micro-architecture, yet further development requires either abstracting the pipeline or using existed libraries, like CUTLASS.
Perhaps it's even better for performance to create JIT-compiled CUTLASS-based templates (like nervana convolutions).
Other options are various domain-specific languages and deep learning compilers. The idea is to extend their primitives for low-bit inference-training:
TVM and glow require a closer look to better understand the pros and cons.
PlaidML, Halide, and TensorComprehensions are not suitable for my use case. I didn't find any way to use tensor cores on the Nvidia Turing micro-architecture. Intel ngraph framework is not an option either, as it is more like routing kernel calls to the supported libraries-backends.
It is clear that PyTorch(target framework, but not the only) is moving toward a hybrid computational graph, where jit traced subgraphs are compiled on the fly(tvm or glow). Also, PyTorch is about to introduce first-tier support for the quantization in general(same for tvm) what plays nicely with the idea of subgraphs compilation. So, that supports the idea of implementing low bit operations directly in the tvm/glow primitives.
PyTorch quantization design document is here
Choosing between TVM and Glow, TVM has better code generation support so far(Tensorization + Vectorization + CPU + Nvidia GPU + ARM + etc). As far as I understand, Glow uses independent graph transformations + vendor-specific backends that implements basic linear algebra blocks. Hence, it is not that simple to fuse operators or make more advanced optimizations with Glow. What is even worse, Glow does not support Cuda as a backend, OpenCL only. Glow cannot autotune single algorithm for various scenarios, as TVM does(write once, tune many).
For now, between TVM and Glow I would choose TVM.
TVM or Cuda + CUTLASS? Simple question, definitely TVM because of the autotuning and code generation opportunities. It is not that simple to transfer TVM functions to the PyTorch. I need either to help with implementation and integration of the whole quantization stack in TVM and PyTorch or to code-gen Cuda code from TVM function and wrap it as a PyTorch module. In any way, it is a good idea to write and optimize TVM code first. And then decide how to proceed.
Original problem remains: either to express convolutions as matrix multiplication or not.
References:
Problems found:
When implementing backward pass wrt to the data, it is necessary to express computations as a simple convolution (not transposed). There are many reasons for this. The simplest is that we can't express transposed convolution result based on the universal indexing rule (primary tvm approach).
It's a well-known fact that many convolutions can be thought of as a direct matrix multiplication(Im2Col and more subtle ideas). cuDNN white-paper directly states that NVIDIA developers use precisely this approach. Small reverse engineering reveals that NVIDIA developed a set of the architecture-specific matrix multiplication kernels tuned for specific data shapes. And many convolutions are mapped to these basic blocks via some heuristics. Perhaps, it is a good idea to follow the same path and thought of any convolution as a bunch of general matrix multiplications routines.
On the other side, it is fruitful to apply direct convolutions for CPU. Thanks to CUDA ideology, it is enough to implement convolutions just for the smallest data shape. Further scaling will be done by adjusting grid parameters, and additional tuning is also possible by providing more specific kernels implementations.
The question is, what is more flexible? Bunch of the matrix multiplication kernels + logic to turn it into convolutions or convolution kernels + logic to turn it into matrix multiplication(for RNN or fully-connected).