akielaries / openGPMP

Hardware Accelerated General Purpose Mathematics Package
https://akielaries.github.io/openGPMP/
MIT License
8 stars 3 forks source link

Intrinsic support #109

Open akielaries opened 9 months ago

akielaries commented 9 months ago

So far intrinsics are only seen in mtx.cpp and vector.cpp. In the latter, look at the pieces of duplicated code and possibly create functions for these. Notice loops are blocked by a specific number that takes register width and data type into account for each ISA supported, some preprocessor macros like defines or even typedefs could probably be created for all of these "magic numbers" but they are mostly intuitive. For example:

#ifdef __AVX2__

// instruction set specific int
typedef iss_int m256i

// instruction set specific iteration size

// signed 8 bit int
#define ISS_I8_ITER 32

// signed 16 bit int
#define ISS_I16_ITER 16

#ifdef __AVX__

typedef iss_int m128

etc?

Overall there's a lot of conditional compilation in the two files so make it as clean as possible and less duplication

akielaries commented 9 months ago

This has been somewhat fixed where files exist for specific types and intrinsic ISAs.

Next look into why the functions we have are so embarrassingly slow. Comparisons of our functions using intrinsics vs naive implementations with 3 nested loops sometime show no performance increase and in some cases the naive function performs better. Beyond just blocking and stuffing registers with values there have to be some better ways to optimize this code

akielaries commented 9 months ago

The reason for this could be a few things. Cache alignment has only been monitored on some functions but this must be a contributor and just memory access in general. Here is the new place with matrix/vector operations:

BY DEFAULT: Routines that are BLAS inspired using their naming conventions (i.e. DGEMM = Double precision GEneral Matrix-Matrix product). These will most likely be big enough for their own files where we will have some of our own naming conventions. We want to make sure there is support for arrays and vectors to start

akielaries commented 9 months ago

There are Double, Float, and int implementations for GEMM routines under the linalg/ module. Lots of reused code while some is actually different depending on our types. Look into this for eliminating code duplication

akielaries commented 9 months ago

SGEMM implementation for single precision (float) implementation mismatches the naive implementation by quite a bit causing the test cases to fail due to being outside of a 0.01 threshold