WIP: i32 gemm experiment

bluss / matrixmultiply

General matrix multiplication of f32 and f64 matrices in Rust. Supports matrices with general strides.

https://docs.rs/matrixmultiply/

Apache License 2.0

209 stars 25 forks source link

WIP: i32 gemm experiment #28

Open bluss opened 5 years ago

bluss commented 5 years ago

cc @SuperFluffy #24

A place to discuss this

SuperFluffy commented 5 years ago

I was trying the i32 kernels, and noticed that when optimizing for my CPU, which supports AVX2, that I am experiencing quite significant performance losses, compared with no optimization.

This is on my laptop though, and the CPU might get downclocked quite badly when running AVX2 instructions. I am going to check the performance on amazon ec2.

cargo bench i32

test layout_i32_032::ccc ... bench:       6,327 ns/iter (+/- 1,860)
test layout_i32_032::ccf ... bench:       6,409 ns/iter (+/- 346)
test layout_i32_032::cfc ... bench:       6,529 ns/iter (+/- 777)
test layout_i32_032::cff ... bench:       6,530 ns/iter (+/- 553)
test layout_i32_032::fcc ... bench:       6,037 ns/iter (+/- 278)
test layout_i32_032::fcf ... bench:       6,107 ns/iter (+/- 347)
test layout_i32_032::ffc ... bench:       6,331 ns/iter (+/- 387)
test layout_i32_032::fff ... bench:       6,250 ns/iter (+/- 553)
test mat_mul_i32::m004   ... bench:         123 ns/iter (+/- 10)
test mat_mul_i32::m006   ... bench:         218 ns/iter (+/- 28)
test mat_mul_i32::m008   ... bench:         235 ns/iter (+/- 45)
test mat_mul_i32::m012   ... bench:         672 ns/iter (+/- 126)
test mat_mul_i32::m016   ... bench:       1,064 ns/iter (+/- 125)
test mat_mul_i32::m032   ... bench:       6,461 ns/iter (+/- 397)
test mat_mul_i32::m064   ... bench:      44,446 ns/iter (+/- 4,438)
test mat_mul_i32::m127   ... bench:     323,990 ns/iter (+/- 59,059)

RUSTFLAGS="-C target-cpu=native" cargo bench i32

test layout_i32_032::ccc ... bench:       9,602 ns/iter (+/- 2,058)
test layout_i32_032::ccf ... bench:       9,091 ns/iter (+/- 1,389)
test layout_i32_032::cfc ... bench:       9,017 ns/iter (+/- 1,223)
test layout_i32_032::cff ... bench:       9,745 ns/iter (+/- 1,868)
test layout_i32_032::fcc ... bench:       8,460 ns/iter (+/- 836)
test layout_i32_032::fcf ... bench:       8,449 ns/iter (+/- 285)
test layout_i32_032::ffc ... bench:       8,714 ns/iter (+/- 501)
test layout_i32_032::fff ... bench:       9,798 ns/iter (+/- 1,896)
test mat_mul_i32::m004   ... bench:         111 ns/iter (+/- 12)
test mat_mul_i32::m006   ... bench:         213 ns/iter (+/- 23)
test mat_mul_i32::m008   ... bench:         194 ns/iter (+/- 8)
test mat_mul_i32::m012   ... bench:         500 ns/iter (+/- 38)
test mat_mul_i32::m016   ... bench:         735 ns/iter (+/- 91)
test mat_mul_i32::m032   ... bench:       8,724 ns/iter (+/- 449)
test mat_mul_i32::m064   ... bench:      56,474 ns/iter (+/- 2,740)
test mat_mul_i32::m127   ... bench:     344,463 ns/iter (+/- 7,465)

bluss commented 5 years ago

I can't reproduce the slowdown, likely because I don't have avx2 on this machine, only avx.

bluss commented 5 years ago

What needs to be answered is, does it make sense to provide this with wrapping semantics? Saturating semantics is not an alternative, unless we find a simd implementation of that. The wrapping implementation is at least reasonably fast and autovectorizes already.

SuperFluffy commented 5 years ago

Assuming you are fine with avx2, we have the following intrinsics available:

_mm256_add_epi8
_mm256_add_epi16
_mm256_add_epi32
_mm256_add_epi64
_mm256_adds_epi8
_mm256_adds_epi16
_mm256_adds_epu8
_mm256_adds_epu16

add is wrapping, adds is saturating. I personally would prefer saturating semantics, because to my mind they are “less wrong”. epi are signed, epu unsigned integers.

For multiplication we have:


_mm256_mul_epi32
_mm256_mul_epu32
_mm256_mulhi_epi16
_mm256_mulhi_epu16
_mm256_mulhrs_epi16
_mm256_mullo_epi16
_mm256_mullo_epi32

In my mind, _mm256_mullo_epi16 and _mm256_mullo_epi32 are the most interesting, because the product of two i8 fits into i16, and similar for i32. So if we restrict ourselves to i8 and i16 inputs we are fine on the multiplication front. It's also the conservative thing to do, because that's what intel and google seem to be doing.

If we restrict ourselves to i8 inputs and convert to i32, we can apply _mm256_mullo_epi32 which uses i64 intermediates and returns the lower half as a truncated i32.

Since we know that we only use the first half of the i32 bits, we can also use fairly large matrices.

So I think we should:

Only implement integer gemm for i8 and u8 inputs;
Decide if we want i32 or i64 intermediates in the multiplication phase and hence i16 or i32 during accumulation, leading to output i16 or i32 output matrices.

For my personal use case, I am only interested in multiplying integer matrices consisting of +/- 1, so I'd opt for performance and choose i8 * i8 -> i16.

Similar add and mul instructions are also available for _mm, using a potpourri of sse, sse2, ssse3m sse4.1.

Given that we then know that during the multiplication phase we won't have overflow, I'd prefer to have saturating semantics, because they feel less “wrong” to me.

mratsim commented 5 years ago

Hey!

I'm currently benchmarking Facebook's PyTorch Glow matrix multiplication (https://github.com/pytorch/glow/issues/1749#issuecomment-449846783) and I noticed that they have a i8 GEMM implementation here

Regarding slowdown, on integer GEMM I think it's normal, I have to benchmark MKL integer GEMM but I don't expect much.

This is my single-threaded performance in Laser

Laser production implementation
Collected 10 samples in 9.853 seconds
Average time: 985.001 ms
Stddev  time: 4.657 ms
Min     time: 979.731 ms
Max     time: 997.174 ms
Perf:         14.371 GINTOP/s

My kernel config is the following:

template int32x8_muladd_unfused_avx2(a, b, c: m256i): m256i =
  mm256_add_epi32(mm256_mullo_epi32(a, b), c)

ukernel_generator(
      x86_AVX2,
      typ = int32,
      vectype = m256i,
      nb_scalars = 8,
      simd_setZero = mm256_setzero_si256,
      simd_broadcast_value = mm256_set1_epi32,
      simd_load_aligned = mm256_load_si256,
      simd_load_unaligned = mm256_loadu_si256,
      simd_store_unaligned = mm256_storeu_si256,
      simd_mul = mm256_mullo_epi32,
      simd_add = mm256_add_epi32,
      simd_fma = int32x8_muladd_unfused_avx2
    )

This is a generic kernel config, I just need to swap the types and SIMD called.

It allows me to reach 98%~101% of OpenBLAS on float32 with this config:

ukernel_generator(
      x86_AVX_FMA,
      typ = float32,
      vectype = m256,
      nb_scalars = 8,
      simd_setZero = mm256_setzero_ps,
      simd_broadcast_value = mm256_set1_ps,
      simd_load_aligned = mm256_load_ps,
      simd_load_unaligned = mm256_loadu_ps,
      simd_store_unaligned = mm256_storeu_ps,
      simd_mul = mm256_mul_ps,
      simd_add = mm256_add_ps,
      simd_fma = mm256_fmadd_ps
    )

ukernel_generator(
      x86_AVX_FMA,
      typ = float64,
      vectype = m256d,
      nb_scalars = 4,
      simd_setZero = mm256_setzero_pd,
      simd_broadcast_value = mm256_set1_pd,
      simd_load_aligned = mm256_load_pd,
      simd_load_unaligned = mm256_loadu_pd,
      simd_store_unaligned = mm256_storeu_pd,
      simd_mul = mm256_mul_pd,
      simd_add = mm256_add_pd,
      simd_fma = mm256_fmadd_pd,
    )

So I think integer SIMD is just slow (latency bottleneck?).

For reference due to the lack of mm256_mullo_epi64 in AVX2, I use scalar code fallback for int64 and can surprisingly achieve 10 GINTOP/s on my machine.

type Int64x2 = array[2, int64]

func setZero_int64_sse2_fallback(): Int64x2 {.inline.} =
  discard

template set1_int64_sse2_fallback(a: int64): Int64x2 =
  [a, a]

func load_int64_sse2_fallback(mem_addr: ptr int64): Int64x2 {.inline.}=
  let p = cast[ptr UncheckedArray[int64]](mem_addr)
  [p[0], p[1]]

func store_int64_sse2_fallback(mem_addr: ptr int64, a: Int64x2) {.inline.}=
  let p = cast[ptr UncheckedArray[int64]](mem_addr)
  p[0] = a[0]
  p[1] = a[1]

template add_int64_sse2_fallback(a, b: Int64x2): Int64x2 =
  [a[0] + b[0], a[1] + b[1]]

template mul_int64_sse2_fallback(a, b: Int64x2): Int64x2 =
  [a[0] * b[0], a[1] * b[1]]

template fma_int64_sse2_fallback(a, b, c: Int64x2): Int64x2 =
  [c[0] + a[0]*b[0], c[1] + a[1]*b[1]]

ukernel_generator(
      x86_SSE2,
      typ = int64,
      vectype = Int64x2,
      nb_scalars = 2,
      simd_setZero = setZero_int64_sse2_fallback,
      simd_broadcast_value = set1_int64_sse2_fallback,
      simd_load_aligned = load_int64_sse2_fallback,
      simd_load_unaligned = load_int64_sse2_fallback,
      simd_store_unaligned = store_int64_sse2_fallback,
      simd_mul = mul_int64_sse2_fallback,
      simd_add = add_int64_sse2_fallback,
      simd_fma = fma_int64_sse2_fallback
    )

Edit: Facebook FBGEMM also has int8 quantized matrix multiplication, but I think it's also for a sparse case: https://github.com/pytorch/FBGEMM

SuperFluffy commented 5 years ago

@mratsim, have you stumboled over Google's gemmlowp?

From what I understood, the main ingredient for their avx2 based integer gemm is to have a more sophisticated packing routine. Instead of calculating the outer product between a column of the LHS matrix and a row of the RHS matrix, they perform an outer product + proper matrix multiplication at the same time.

The reason for this is that avx2 only gives you 16 packed vector registers, while the outer product between two packed sign-extended 16 bit integers (sign-extended from 8 to 16 bit) requires 1616/8 registers (or 1632/8 even when accumulating in 32 bits). You thus run into the issue of permanent register spilling to get the data in and out.

Here is a schematic to make it clearer:

    RHS 0 1 2 3
LHS     4 5 6 7
0 8
1 9
2 a
3 b
4 c
5 d
6 e
7 f

0 8
1 9
2 a
3 b
4 c
5 d
6 e
7 f

0 8
1 9
2 a
3 b
4 c
5 d
6 e
7 f

They pack their memory in such a way that on the LHS you have 3 panels of 8x2 matrices adjacent. The RHS is packed such that you have panels of 2x4 matrices adjacent. You now perform the outer-product + matrix-multiplication between the RHS 3 times. This allows us to accumulate the results in 3x4 = 12 registers in total. this leaves 4 registers to load the panels from the LHS and RHS.

ndattani commented 5 years ago

Hello! What is the status of this experiment? Are we now able to multiply two 32-bit integer matrices together? How does the speed compare to cblas_gemm_s16s16s32 which can only input 16-bit matrices and output 32-bit matrices?

SuperFluffy commented 5 years ago

Hi @ndattani. This effort stalled because I realized that I didn't actually need integer matrices for my implementation of Al-Mohy and Higham's expm .

Also, reading some other integer gemm implementations out there I realized that it's quite a bit more involved than floating point version because you need to have more complicated integer packing for the kernels to operate on.

So I am sorry to disappoint you in this. I currently don't have any capacities to implement this. :(

mratsim commented 5 years ago

int32 x in32 -> int32, i.e. without mixed-precision accumulation is just replacing the float SIMD with integer SIMD and are much easier than int16 x int16 -> int32 (or f16 x f16 -> f32 for that matter).

The main issue is having a fallback as int32 SIMD multiplication requires SSE4.1 and int64 SIMD multiplication requires AVX512 even when just using the SSE part of the code.