Open akielaries opened 9 months ago
This has been somewhat fixed where files exist for specific types and intrinsic ISAs.
Next look into why the functions we have are so embarrassingly slow. Comparisons of our functions using intrinsics vs naive implementations with 3 nested loops sometime show no performance increase and in some cases the naive function performs better. Beyond just blocking and stuffing registers with values there have to be some better ways to optimize this code
The reason for this could be a few things. Cache alignment has only been monitored on some functions but this must be a contributor and just memory access in general. Here is the new place with matrix/vector operations:
BY DEFAULT: Routines that are BLAS inspired using their naming conventions (i.e. DGEMM = Double precision GEneral Matrix-Matrix product). These will most likely be big enough for their own files where we will have some of our own naming conventions. We want to make sure there is support for arrays and vectors to start
There are Double, Float, and int implementations for GEMM routines under the linalg/
module. Lots of reused code while some is actually different depending on our types. Look into this for eliminating code duplication
SGEMM implementation for single precision (float) implementation mismatches the naive implementation by quite a bit causing the test cases to fail due to being outside of a 0.01 threshold
So far intrinsics are only seen in
mtx.cpp
andvector.cpp
. In the latter, look at the pieces of duplicated code and possibly create functions for these. Notice loops are blocked by a specific number that takes register width and data type into account for each ISA supported, some preprocessor macros like defines or even typedefs could probably be created for all of these "magic numbers" but they are mostly intuitive. For example:Overall there's a lot of conditional compilation in the two files so make it as clean as possible and less duplication