Test for arm64 & x86 runtime performance

adeeconometrics commented 5 months ago

Should see if the refactored Matric::operator= improve performance for larger matrices

adeeconometrics commented 4 months ago

dev-lazymatrix implements a simpler std::vector internal, some of the benefits found in this branch are:

Better compiler optimization for clang loop vectorize up to 690 GFLOPs for M2
Allows single standard Matrix implementation, some contentions in reverting back to std::array for smaller matrices for $\mathbb{R}^{M \times N} < \mathbb{R}^{256 \times 256}$ are: (a) stack size varies per hardware per type so it is difficult to design for reliable performance and universal types, (b) C++ standard does not impose stack size requirement for std::thread library so it's difficult to conditionally adapt $M,N$ size per hardware (metadata is not available).

Screenshot 2024-05-18 at 1 07 18 AM

Note: Function that was evaluated $$A \times B + A \cdot B \cdot (\sin(A) \times \cos(A) + B)$$

adeeconometrics commented 4 months ago

Avenues to explore:

[ ] GCC equivalence of loop vectorized directive and associated compiler optimization for SIMD
[ ] See if it is worth it to specialize for NEON/SSE intrinsic for special types

adeeconometrics / LazyMat