OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.26k stars 1.48k forks source link

Modifying GEMM for strided GEMM computation #2289

Open nitish2112 opened 4 years ago

nitish2112 commented 4 years ago

Hi,

We want to implement strided GEMM implementation where the dot product for each output element is computed using only even indexed elements from rows/columns of matrix A/B. Seems like the routine for GEMM is written in assembly? Is there a way to implement this without hacking the assembly? If assembly is the only way can you please help us with this?

Thanks, Nitish

martin-frbg commented 4 years ago

The implementation depends on the cpu (as specified in the respective KERNEL.<target> files under kernel/<architecture>) - most are written in assembly, but there are also "generic" C implementations (and C with AVX512 intrinsics in the case of the SKYLAKEX target) that you could reuse. Could be that Goto's algorithm for splitting the matrix into cache-sized chunks will mess with performance when you only need to access every n'th other element though.

nitish2112 commented 4 years ago

Thanks! Can you please point us to the C + AVX512 implementation/files?

PS: There are so many macros that I am lost :D

Thanks, Nitish

brada4 commented 4 years ago

It will not be single-value stride, you need too take at least cache line, 32 or 64 bytes with same alignment from memory at once into registers, or you get painful memory bus saturation by using 4-16 times of that bus for small crumbles. If that can be done in C in some generic way, it is likely to get close to assembly kernels and be quite good for brand new architecture targets when processed with new compilers...

wjc404 commented 4 years ago

Is it like this: ....for i = 0 to n-1 ........for j = 0 to m-1 ............for l = 0 to k-1 ................c[i][j] += a[l][j] b[i][l] (double)(l%2); For extracting elements with even k or n indexes, the solution is simple, if the matrix C is column-major.

brada4 commented 4 years ago

It gets quite some macros if you (make compiler) read 4..16 elements i.p.o single value per read....

martin-frbg commented 4 years ago

Thanks! Can you please point us to the C + AVX512 implementation/files?

Files are in kernel/x86_64, sgemm_kernel_16x4_skylakex.c and dgemm_kernel_4x8_skylakex_2.c for single and double precision real - for complex values the assembly files for Haswell, cgemm_kernel_8x2_haswell.S and zgemm_kernel_4x2_haswell.S are used. (As mentioned the KERNEL files there list the names of implementation files used for a particular cpu target).

There is also a plain-C CGEMM/ZGEMM kernel available as kernel/generic/zgemmkernel_2x2.c . If you need to switch to those "generic" kernels or write your own, the basic rules are the M x N in the kernel name (and coding) must match the xGEMM_UNROLL_M and N specified for that cpu in the toplevel param.h, with corresponding xINCOPY/xONCOPY helper functions to handle the chunks. (And one major gotcha is that the TRMM kernel implicitly uses the same copy routines as specified for GEMM, so must use the same MxN )

If you want this to be an addition to the code base rather than a quick hack to modify the supplied GEMM functions for your purpose, it gets a little more complicated as you would need to provide an interface function (in interface directory) that forwards through "the appropriate functions" in driver/level3 that perform the chunking and reassembling to/from cache-sized blocks and do the actual math kernel calls.

brada4 commented 4 years ago

worth mentioning that more measurements and less code change for good improvement can be done at those work dividers Martin mentioned in last paragraph. Certainly depends on your academic goals. Current algorithm is 10 years old n-row strides plus the code flow is assembly that avoids CPU ALU and cache controller stalls. Very accurate adjustment can gain 10% in performance on average, or by magnitude in corner cases.

nitish2112 commented 4 years ago

Files are in kernel/x86_64, sgemm_kernel_16x4_skylakex.c and dgemm_kernel_4x8_skylakex_2.c for single and double precision real - for complex values the assembly files for Haswell, cgemm_kernel_8x2_haswell.S and zgemm_kernel_4x2_haswell.S are used. (As mentioned the KERNEL files there list the names of implementation files used for a particular cpu target).

@martin-frbg We are only interested in sgemm and dgemm (not cgemm and zgemm). When we do make in OpenBLAS seems like sgemm_kernel_16x4_skylakex.c and dgemm_kernel_16x4_skylakex.c are not being compiled (we checked this by dumping the entire build trace). Also running the gemm code in benchmark/gemm.c we did not see a call to any function in sgemm_kernel_16x4_skylakex.c. Are we missing anything here?

Also, seems like sgemm_kernel_16x4_skylakex.c is for 16x4 matrix-multiply. Can you please also point us to the code that chunks the matrices into 16x4 tiles and calls the functions in sgemm_kernel_16x4_skylakex.c?

Thanks a lot for the help! -Nitish

brada4 commented 4 years ago

user calls into interface/ that calls into driver/ which then call or inline into kernel/ For example one shortcoming is that interface/ has no idea about kernel/ strides, so may split work suboptimally -say one input dimension is 2,5 strides, that gets split based on 4 CPUs into 0.6 of stride, and skip assembly altogether.

wjc404 commented 4 years ago

@nitish2112 I would like to help you, but I need a simple expression in C codes of the gemm function you want to implement (function parameters, body and return values).

martin-frbg commented 4 years ago

@nitish2112 most likely your compiler is too old to support AVX512, or you are compiling on a cpu that does not support AVX512. One clue is in the name of the library, does it have "skylakex" in it or just "haswell" ? (The other is that you would see -DNO_AVX512 being passed to individual compile steps in the makelog). It would certainly help to know more about what exactly you are trying to do, and if you are familiar with Kazushige Goto's paper explaining the GEMM partitioning algorithm (linked from the admittedly very sketchy "Developer manual" in the Wiki)