Samsung / ONE

On-device Neural Engine
Other
429 stars 157 forks source link

[Compiler] Native code generation #5440

Open binarman opened 3 years ago

binarman commented 3 years ago

Lets investigate code generation techniques in neural network world.

Some NN has subgraphs with lots of relatively small operators, that introduce a lot of overhead on runtime. There are also cases when some operators can be simplified and fused that can lead to performance improvement.

Some work in this direction is already done in other projects. For example:

Probably we can make use of these technologies to improve inference performance

binarman commented 3 years ago

To get some understanding of optimization possibilities I've experimented a bit with small subgraph from lpcnet Model has GRU layer that is transformed in set of small operators.

Subgraph from original network I targeted: subgraph

I rewrote this subgraph using primitive tensorflow functions and evaluated it's performance with ONERT. Then I reimplemented this graph in c++ in three variants:

  1. naive implementation (just for reference)
  2. naive implementation, but with kernels from ONERT (I assume no loop fusion was applied by compiler, though I did not inspect it carefully)
  3. optimized implementation: manually fused and unrolled loops.

Performance results are following (100 000 runs, warm up 100):

  1. ONERT: ~3.05 microseconds (I used modified nnpackage_run to get rid of infrastructure overhead and get better precision)
  2. C++ with ONERT kernels: ~2.15 microseconds
  3. C++ optimized: ~1.7 microseconds

Optimized model has three clusters of fused layers. It is possible to make only two clusters, but there are not enough registers to hold all needed data, so three-cluster implementation performed better: schema_fuse2

I think these results look promising, though this is just a small part of the network. probably bigger layers, like FC can not offer such improvement.

related files: gru_experiment.zip

binarman commented 3 years ago

I tried to generate code using Tiramisu infrastructure: tiramisu_experiemnts.zip due to it's technical limitations I broke graph into different loops: subgraph

Generated code took ~2.5 microsecnds (this is less than onert with 3 microseconds, but significantly longer than handmade version with 1.7 microsecods)

To understand why Tiramisu slower I wrote new version for manual computations. (I also noticed error in previous version and fix it. it did not affect performance though) Source code for this version in manual_tiramisu_edition.cpp in gru_experiment.zip

It turned out that this version is also faster than previous one and achieves 1.62 microseconds.

Next I want to:

  1. Compare Tiramisu generated code and manual implementation to understand why performance differs
  2. Try Halide IR, instead of Tiramisu (Tiramisu uses Halide itself, so I want to skip one layer of abstraction)
  3. Experiment with matrix multiplication (if we can speedup this operation we probably can speedup a lot of networks)
chunseoklee commented 3 years ago

@binarman FYI, Here is a survey for nn compiler : arXiv:2002.03794

binarman commented 3 years ago

@chunseoklee Thank you!

I've already tried to use TVM. It's performance on x86 is the best according to this article, but for some reason it was poor on arm64 (I tried several convolution networks and arithmetic networks, like in experiments above). Maybe I missed something and should redone investigation, I did not use auto-tuning features.

binarman commented 3 years ago

I've tried to recreate optimized function using Halide and got ~1.72 micro-second result I think additional time is introduced by safety checks Halide performs and I do not, so we can assume Halide performance is essentially the same as manual code.

Related files: halide_arithmetics.zip

Next I want to compare onert and halide BLAS implementation: https://github.com/halide/Halide/tree/master/apps/linear_algebra/src

I think I'll take size of matrices from Inception, mobilenet and lpcnet as a "target".

binarman commented 3 years ago

I tested FC layers of different sizes with simple halide schedule:

for output_dim:  <- unroll and vectorize this loop 4 times
  result[output_dim] = bias[output_dim];
  for input_dim:
    result[output_dim] += weights[output_dim][input_dim] * input[input_dim];
all timing in microseconds, one thread weight size onert halide
1001x1024 254 254
1001x1280 325 325
1001x2048 530 540
1152x384 96 67
1152x512 130 92
48x16 0.5 0.34
48x512 5.4 3.67
4x4000 3.8 2.43

Small FC layers gains a lot of speedup, large ones does not, I suppose this is because algorithm breaks into memory limits or something, for now I want to focus on one threaded computations, since this what we probably need for low-end devices.

Next I want to prototype compilation into ONE compiler and investigate if this approach can speedup lcpnet/other recurrent networks.

glistening commented 3 years ago

I tested FC layers of different sizes with simple halide schedule:

@binarman Could you please let me know what target device did you test on?

binarman commented 3 years ago

@glistening Sorry, I forgot to mention it. All measurements are done on Galaxy S20 (exynos) on fastest core.

Note: I tweaked onert to run model 1000 times for every time measurement, because default schema works in millisecond range.

chunseoklee commented 3 years ago

I tested FC layers of different sizes with simple halide schedule:

for output_dim:  <- unroll and vectorize this loop 4 times
  result[output_dim] = bias[output_dim];
  for input_dim:
    result[output_dim] += weights[output_dim][input_dim] * input[input_dim];

all timing in microseconds, one thread

weight size onert halide 1001x1024 254 254 1001x1280 325 325 1001x2048 530 540 1152x384 96 67 1152x512 130 92 48x16 0.5 0.34 48x512 5.4 3.67 4x4000 3.8 2.43 Small FC layers gains a lot of speedup, large ones does not, I suppose this is because algorithm breaks into memory limits or something, for now I want to focus on one threaded computations, since this what we probably need for low-end devices.

Next I want to prototype compilation into ONE compiler and investigate if this approach can speedup lcpnet/other recurrent networks.

@binarman Q1) This experiment tested with f32 model ?

Q2) What is the batch(of FC) size ?

binarman commented 3 years ago

@chunseoklee 1) Yes, this was f32 models 2) Just 1, I just tested operations from relevant models, like lpcnet.

During experiments I tried to outperform existing kernels and found that it is relatively simple if input data is small or has some uncommon properties (like on dimension is large, but other is small). I did not test quantized operators, but I think results will be similar.

By the way, I do not remember if I mentioned it somewhere, so here is my current idea(in POC #5836):

wateret commented 3 years ago

@binarman Have you tried building those halide or tiramisu for Linux(Odroid XU4 Ubuntu)? And could you share the code to test FC Layer?

binarman commented 3 years ago

@wateret Halide and tiramisu(with a little of hacking) can emit code for arm architecture. I tested it on galaxy S20 with exynos. Actual generator I used: https://github.com/binarman/Halide/blob/arm_experiments/apps/linear_algebra/src/blas_l2_generators.cpp#L271 (as you can see it is pretty simple, actual kernel described by 3 lines 291-293)

To build you should build halide first (you can use <repo>/build/makeall.sh script) Then use <repo>/apps/makeall.sh script to build generators and benchmark. Built benchmark app is located in <repo>/apps/build/linear_algebra/benchmarks. Check contents of mentioned scripts, you need to change paths (for example for llvm)

wateret commented 3 years ago

@binarman Thank you for sharing!

wateret commented 3 years ago

tiramisu(with a little of hacking)

Does it mean directly using this function fct->gen_halide_obj(obj_filename, os, arch, bits) rather than tiramisu::codegen in your tiramisu_experiment.zip?

binarman commented 3 years ago

@wateret If I recall it correctly tiramisu::codegen method does not provide cross-build functionality. I wanted to run code on android phone (and do not have any arm64 board at the time), so I used available cross-compile method PC -> arm.