Open chunseoklee opened 3 years ago
I'll take a look at tiramisu.
TVM has two ways to auto-tune kernel for model
conv2d
and depthwise_conv2d
templates for ARM
(int8*int8) -> int32
NxKxM | TVM | TVM (Weight transform) | ONE |
---|---|---|---|
1x256x6144 | 840 | 510 | 392 |
1x192x6144 | 460 | 330 | 265 |
1x128x6144 | 350 | 220 | 150 |
8x1536x64 | 230 | 190 | 94 |
4x1000x64 | 80 | 60 | 36 |
@periannath Hi! FYI I started implementation of Halide based compiled in #5836 You can see basic FC implementation here: https://github.com/Samsung/ONE/blob/5fba53110a0b6aa763fe5d3c4f34d73226910b6f/compiler/luci-codegen/src/KernelBuilder.cpp#L312
Also note that I used autoscheduling functionality of Halide for now (it shows reasonable performance).
@binarman Thanks! I'll look into it.
I found that Halide have three autoschedulers (Mullapudi2016
, Li2018
, and Adams2019
). https://github.com/halide/Halide/releases/tag/v10.0.0
Which autoscheduler did you used?
Which autoscheduler did you used?
I tried all of them, but with different workload. 1) I used "arithmetic" graph, not FC:
2) I did not try to parallelize computations yet. For now my main goal is to optimize one threaded computations. Rationale is following: I want to speedup subgraphs with lots of small operators (I think this is not the best target for paralleization), reduce runtime overhead for low-end devices (like microcontrollers, they often do not have multiple cores, so nothing to parallelize).
I tried to run model above on Galaxy S20 on fastest code, my results are:
Scheduler | Time in microseconds |
---|---|
None | ~7.80 |
Mullapudi2016 | ~2.01 |
Li2018 | ~500 |
Adams2019 | ~2.16 |
Li2018 is extremely slow because it emits parallelization/syncronization code despite the fact I configured it with 1 available processor.
@periannath FYI Compiler I am working now currently have no convenient switches to choose scheduler (you need to fix it in code and recompile), but I hope I'll fix it soon.
I am also want to note that all computations I tested til now was in float. I am not not sure It will work correctly with FC layers described in this issue now.
Could you tell more about what is int8xint8 -> int32 FC, in particular: does it involve quantization of output?
@periannath @wateret Sorry, another question about this FC =_= I want to test it using my draft.
Could you clarify where to make casts (int8 -> int32)?
I can imagine two option:
1) output += (int32)weights * (int32)input
2) output += (int32)(weights * input)
@binarman Thank you for sharing your numbers. I tried to use autoschedule in odroid-xu4 and Adams2019
works better for me.
Could you tell more about what is int8xint8 -> int32 FC, in particular: does it involve quantization of output?
Hybrid FullyConnected kernel in TFLite gets float
input and int8
weight. And its result is float
. The kernel quantizes input to int8
type in inference time and do int8*int8
computation. The result is int32
type tensor and it is transformed into float
type.
We're trying to generate better kernel for (int8*int8) -> int32
type GEMM in hybrid FC.
You might read @chunseoklee's analysis on hybrid FC link.
Could you clarify where to make casts (int8 -> int32)? I can imagine two option:
1. `output += (int32)weights * (int32)input` 2. `output += (int32)(weights * input)`
~AKAIK, option 2 is right but TVM can't support it. I used option 1 for TVM.~
Updated : TFLite uses option 1 with int16
type. https://github.com/Samsung/ONE/issues/5951#issuecomment-785652647
(int8*int8) -> int32
out(x, y) = 0
out(x, y) = in(x, r.x) * weight(y, r.x)
out
should be initialized to bias
value but I've initialized it to 0
for easier testingAdams2019
autoscheduleNxKxM | ONE | Halide |
---|---|---|
1x256x6144 | 392 | 2067 |
1x192x6144 | 265 | 1493 |
1x128x6144 | 150 | 993 |
8x1536x64 | 94 | 1300 |
4x1000x64 | 36 | 421 |
NxKxM | ONE | Halide |
---|---|---|
1x256x6144 | 2619 | 9896 |
1x192x6144 | 2058 | 7160 |
1x128x6144 | 1426 | 4651 |
8x1536x64 | 544 | 6496 |
4x1000x64 | 187 | 2111 |
halide_set_num_threads
API is needed to set number of thread that Halide uses@periannath First of all, thank you for your answers!
Second, I want to clarify this item:
Halide produced better kernel for float GEMM #5440 (comment)
This result was achieved: 1) with manual scheduling 2) with 1xKxM (N == 1) operations. 3) with galaxy s20 phone (It has significantly newer hardware compared RPI3 and Odroid)
So, I think I'll retest this operations in new environment. (probably today)
@periannath Sorry for disturbing you again, but I want to clarify two items:
1) I investigated tflite code and found only case 1 (output += (int32)weights * (int32)input
)
Could you check it again, please?
This should have performance impact on generated code.
2) I noticed that you address Halide buffers in your code like it is done in C/C++.
AFAIK in Halide uses reverse order of dimensions, so first dimension is the fastest changing.
For example to represent int a[N][M];
you should use Halide::Buffer<int> a(M, N);
@binarman
1) I investigated tflite code and found only case 1 (
output += (int32)weights * (int32)input
) Could you check it again, please?
AFAIK, TFLite neon kernel uses case 2 for hybrid FC. Here is the neon kernel code for hybrid FC.
The input and weight are loaded in int8x16_t
neon registers. It multiplied low bits into int16x8_t
neon register and do the same thing for high bits. Result is accumulated into int32x4_t
neon register.
In c++ kernel, TFLite kernel uses code like int output = (int8_t)weight * (int8_t)input;
. I thought it works as case 2 but it wasn't. Thanks to you, I've noticed that int8_t
is treated as int
type by implicit type conversion https://en.cppreference.com/w/c/language/conversion#Integer_promotions.
In summary
output += (int32)(weights * input)
output += (int32)weights * (int32)input
If neon is supported on target device, TFLite will use neon kernel as default.
2) I noticed that you address Halide buffers in your code like it is done in C/C++. AFAIK in Halide uses reverse order of dimensions, so first dimension is the fastest changing. For example to represent
int a[N][M];
you should useHalide::Buffer<int> a(M, N);
Thanks! I'll measure the performance again after I've fixed it.
@periannath I checked neon kernel you mentioned. It uses case 1 too. Please, take a look:
// load vector of 16 int8
const int8x16_t s1_8x16 = vld1q_s8((const int8_t*)(aligned_vec + col));
// load vector of 16 int8
const int8x16_t s2_8x16 = vld1q_s8((const int8_t*)(row_ptr + col));
// mult two int8 vectors and store result in vector of int16
int16x8_t prod_16x8 = vmull_s8(vget_low_s8(s1_8x16), vget_low_s8(s2_8x16));
// mult two int8 vectors and add result in previously computed vector of int16
prod_16x8 = vmlal_s8(prod_16x8, vget_high_s8(s1_8x16), vget_high_s8(s2_8x16));
// add pairs of elements in prod_16x8, and and store them in 32-bit integer vector
dotprod_32x4 = vpadalq_s16(dotprod_32x4, prod_16x8);
You can test it with extracted code from kernel: arm_long_test.zip
P.s. This is because kernel uses "long" version of instruction. You can see it by additional "l" in the name: vmlal_s8
vs vmla_s8
,vmull_s8
vs vmul_s8
@binarman I wasn't aware of long
version in arm intrinsic. Thanks for letting me know.
I've checked neon kernel in detail. As you said, kernel uses case 1 but with int16_t
type.
// mult two int8 vectors and store result in vector of int16
int16x8_t prod_16x8 = vmull_s8(vget_low_s8(s1_8x16), vget_low_s8(s2_8x16));
int8_t
x int8_t
=> int16_t
input
and weight
values are multiplied as int8_t
typeint16_t
formatTo mimic this computation, Halide program should be changed as below.
out(y, x) = 0
out(y, x) += i16(in(r.x,x)) * i16(weight(r.x, y));
9f0: f2c8ccaa vmull.s8 q14, d24, d26
9f4: e35c0c01 cmp ip, #256 ; 0x100
9f8: f2c98caa vmull.s8 q12, d25, d26
9fc: f2d001ac vaddw.s16 q8, q8, d28
a00: f2d661ad vaddw.s16 q11, q11, d29
a04: f2d221a9 vaddw.s16 q9, q9, d25
a08: f2d441a8 vaddw.s16 q10, q10, d24
vmull.s8
to multiply two int8x8_t
neon registers and save its result into int16x8_t
registerout(y, x) = 0
out(y, x) += i32(in(r.x,x)) * i32(weight(r.x, y));
9f0: f2c8aa38 vmovl.s8 q13, d24
9f4: e35c0c01 cmp ip, #256 ; 0x100
9f8: f2c88a39 vmovl.s8 q12, d25
9fc: ee8c1bb0 vdup.16 d28, r1
a00: f2da08ac vmlal.s16 q8, d26, d28
a04: f2db68ac vmlal.s16 q11, d27, d28
a08: f2d848ac vmlal.s16 q10, d24, d28
a0c: f2d928ac vmlal.s16 q9, d25, d28
vmovl.s8
to load int8x8_t
values into int16x8_t
registerint16x8_t
registers and save its result into int32x4_t
registerI hope there are no more errors in this analysis. 😅
@periannath Thank you for your efforts! I think now everything works good =)
Let's try to generate optimized GEMM kernel for Hybrid FC(https://github.com/Samsung/ONE/issues/5950). Note that Hybrid FC exploits
(int8*int8) -> int32
type GEMM kernel.