dnoan commented 3 years ago

ARMv8 cores can implement optional features in addition to the "base" architecture. These features can be enabled in the compiler by setting the -march= flag. OpenBLAS sets only the base architecture leaving the extra features. The question is if additional performance can be gained by enabling any of the additional features:

CORTEXA53       -march=armv8-a+crc
CORTEXA57       -march=armv8-a+crc
CORTEXA72       -march=armv8-a+crc
CORTEXA73       -march=armv8-a+crc
THUNDERX        -march=armv8-a+crc+crypto
EMAG8180        -march=armv8-a+crc+crypto
FALKOR          -march=armv8-a+crc+crypto+rdma
THUNDERX2T99        -march=armv8.1-a+crypto
NEOVERSEN1      -march=armv8.2-a+fp16+rcpc+dotprod+profile 
TSV110          -march=armv8.2-a+crypto+fp16+aes+sha2
THUNDERX3T110       -march=armv8.3-a+crypto+rcpc+sm4+sha3+fp16fml

The above flags are for GCC. They work with Clang too except for rdma. The corresponding flag for Clang is rdm.

martin-frbg commented 3 years ago

Good question - so far I have believed the crc to offer no advantages- at least relative to the (probable) hassle of adding compiler/version checks to see if the feature is actually known to the compiler. But I have to admit that I never actually benchmarked this, and I do not know that anyone has. Most of the others sound useful, and I may have been misunderstanding how the current combination of -march=(base architecture) -mtune=(specific cpu) acts in comparison to -march=architecture+features. (From my interpretation of the gcc docs, I assumed them to be identical) As the optimized BLAS kernels tend(ed) to be written in assembly, performance improvement would be expected to come from the interface/driver code and LAPACK routines. (The fp16 seems to be the IEEE rather than the "bfloat" that has received some attention in the codebase lately)

brada4 commented 3 years ago

Current ARM flags were optimized by an ARM employee.

fp16 could be useful over time, currently only implementation is on latest intel CPUs using compiler intrinsics around bfloat format. In principle it needs formalisation of function names at NetLIB reference implementation.
Crypto and CRC are not useful for BLAS, since BLAS operates on floats unlikely they ever get useful
dotprod seems promising, I wonder if any compiler could emit it right away for _DOT functions or it needs specific intrinsics
RDMA is advanced ARM SIMD mandated at arch v 8.1
dont know rcpc, profile, sm4 meaning

What could be tested if it works and gains speedup:

if FALKOR rdma can enable some rdma from armv8.1
if compilers would actually emit dotprod instructions from C code without intrinsics or assembly, if they dont it is just few functions, all bound to memory bandwidth, typically compilers emit well-vectorized code under most ISA constraints already

I have just A53 called raspberry, crc did not break anything, but all functions are same size, so I assume it is a no-op on all ISAs.

dnoan commented 3 years ago

Descriptions of the features can be found in the GCC documentation. See for example https://gcc.gnu.org/onlinedocs/gcc-11.1.0/gcc.pdf

I too have Pi-class ARM devices only. Maybe some of this can be tested on Amazon's Graviton2. It utilizes Neoverse-N1 if I remember correctly.

dnoan commented 3 years ago

Regarding -mtune=, the compilers behave differently:

LLVM/Clang ignores it completely except for x64 (only since version 12). See https://www.phoronix.com/scan.php?page=news_item&px=LLVM-Clang-x86-mtune
Intel: "This option performs optimizations for specific processors but does not cause extended instruction sets to be used (unlike -march). " See https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/compiler-options/compiler-option-details/code-generation-options/mtune-tune.html
GCC: "-mtune permits the same extension options as -mcpu, but the extension options do not affect the tuning of the generated code." See https://gcc.gnu.org/onlinedocs/gcc-11.1.0/gcc/ARM-Options.html#ARM-Options

martin-frbg commented 3 years ago

Intel compiler is not available for arm64 AFAIK ? You do seem to be right regarding the -mtune - applying the instruction cost tables of the given cpu but not enabling its specific feature set. This could be significant if e.g. LAPACK code cannot be compiled to make use of SIMD instructions.

brada4 commented 3 years ago

I think it needs detailed measurements to complement what was once contributed by ARM, time goes ahead, there is a good chance that instruction scheduler permits code to run on oldest CPU imaginable losing 1% of speed while adding +20% on the very best server CPU today. The reduced version would be -march=armv8 -mtune=generic in generic case, to actually trust compiler with instruction scheduling to better utilize the mystical common case.

martin-frbg commented 3 years ago

I think nobody is suggesting to change "generic" ARMV8 here, but there are specific -march and -mtune directives for individual cpu targets in Makefile.arm64 where it may be safe to assume that any "related" model would have at minimum the same feature set. OTOH I now see that while crypto is described as "unlocking advanced simd and fp instructions" in the gcc docs, it is stated a few lines later that both of these are "on by default for all possible values of -march and -mcpu". That would seem to leave rdma for Falkor (on by default for armv8.1-a, but Makefile.arm64 currently assigns armv8.1) and dotprod for Neoverse. (rcpc appears to enable use of that particular kind of statements in inline assembly so almost certainly irrelevant, and there does not seem to be anything applicable to OpenBLAS among the various crypto cipher extensions). Not looking so promising anymore I think...

dnoan commented 3 years ago

Just to make a few things clear:

I put the CRC and CRYPTO flags in the list for completeness. I doubt it that they can bring any benefits for OpenBLAS.
Same thing about the Intel compiler. I put it in the list just for comparison. Intel doesn't support ARM AFAIK.
It is true that -march=armv8.1-a enables RDMA on Falkor, however, building with TARGET=FALKOR will compile the code with -march=armv8-a. My understanding is that RDMA will not be used then. I have not studied what happens when building with DYNAMIC_ARCH.

I got interested in this partially because of the follwoing benchmarks done by ARM: https://developer.arm.com/tools-and-software/server-and-hpc/compile/arm-compiler-for-linux/resources/tutorials/benchmarks

They claim that ArmPL (which is based on OpenBLAS) is way faster than OpenBLAS on Graviton2 (Neoverse N1). Even more surprisingly, BLIS is also faster (for larger problems) and it seems that BLIS doesn't even have specific Neoverse N1 support. I thought that this may be in part due to compiler optimizations. Note that ArmPL has a free version with Neoverse N1 optimizations: https://developer.arm.com/tools-and-software/server-and-hpc/downloads/arm-performance-libraries So, this can be tested on Amazon.

brada4 commented 3 years ago

DYNAMIC_ARCH is no problem. Common code (including LAPACK) is build with specified low-capability TARGET= , CPU-specific BLAS functions come in the form of a new overload function for each dynamic target.

martin-frbg commented 3 years ago

I doubt the difference lies in something as simple as compiler flags - and it should be particularly easy to beat OpenBLAS on the Neoverse as we currently do not have any BLAS kernels written specifically for that cpu. What is there is simply the mix of available (generic and thunderx2) ones that displayed the best performance in initial tests. BLIS may have come up with a better algorithm for large problems, and/or better implementation of multithreading (original GotoBLAS was full of race conditions that apparently did not matter in practice on the hardware of the time, current OpenBLAS probably suffers from excessive locking overhead where I did - and probably do - not know how to fix that properly)

AGSaidi commented 3 years ago

Looks like https://github.com/xianyi/OpenBLAS/pull/3270 fixes at least one issue. As i said in https://github.com/numpy/numpy/issues/18422 TravisCI and CircleCI both support running on Graviton. @martin-frbg happy to get you access to a system as well, but we really should enable one of these CI solutions and if you're amenable we can send a PR.

Djip007 commented 3 years ago

If it can help, I note than the Cortex-A55 is a armv8.2-a CPU ... and it can be found on https://www.hardkernel.com/shop/odroid-c4/ board ... an any board with Amlogic S905X3 CPU...

brada4 commented 3 years ago

CPUID is 0xd05 as per ARM. If you have the CPU, add just another value in case () around A53/A55 (they are older, needs measurements to confirm they are closer to ARMv8, A5x or A7x) https://github.com/xianyi/OpenBLAS/blob/a7627c5afd4b3e53533ac4e3e7a11d6e5ad80899/cpuid_arm64.c#L154 https://github.com/xianyi/OpenBLAS/blob/a7627c5afd4b3e53533ac4e3e7a11d6e5ad80899/driver/others/dynamic_arm64.c#L193

Djip007 commented 3 years ago

good point! can confirm: on OdroidC4 / Amlogic S905X3 / Cortex-A55

> cat /proc/cpuinfo 
processor   : 0
BogoMIPS    : 48.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x1
CPU part    : 0xd05
CPU revision    : 0

also test with this code:

// g++-10 -03 -march=armv8.2-a+fp16 fma_16.cpp
// g++-10 -03 -march=native fma_16.cpp

#include <vector>
#include <algorithm>
#include <iostream>

#include <arm_neon.h>

float16_t my_rand() { return float16_t( double(rand()) / double(RAND_MAX) ); }
float16_t my_init() { return float16_t(42); }

int main(void) {
  srand( 11 );
  size_t sz = 80;

  std::vector<float16_t> flt1(sz);
  std::vector<float16_t> flt2(sz);
  std::vector<float16_t> flt3(sz);
  std::generate( flt1.begin() , flt1.end() , my_rand );
  std::generate( flt2.begin() , flt2.end() , my_rand );
  std::generate( flt3.begin() , flt3.end() , my_init );

  for ( size_t i = 0 ; i < sz ; i += 8 ) {
      float16x8_t f1 = vld1q_f16( &(flt1[i]) );
      float16x8_t f2 = vld1q_f16( &(flt2[i]) );
      float16x8_t f3 = vld1q_f16( &(flt3[i]) );
      f3 = vfmaq_f16( f3, f1 , f2);   // f3 = f3 + f1*f2 ... Attention a l'ordre!!!
      vst1q_f16( &(flt3[i]) , f3 );
  }
  std::cout << float(flt1[0]) << std::endl;
  std::cout << float(flt2[0]) << std::endl;
  std::cout << float(flt3[0]) << std::endl;
  return 0;
}

=> fp16 fma is possible on it :sunglasses:

ps: I like to have time to test/help... but not sure (when?) to find time... :crossed_fingers:

brada4 commented 3 years ago

=> no fp16 FMA instrumented in OpenBLAS....

martin-frbg commented 3 years ago

There is currently no framework for IEEE fp16 in OpenBLAS (what is there is the AI "brainfloat" variant thanks mainly to code contributions from IBM). This could certainly change if there is sufficient interest and time.

brada4 commented 3 years ago

I am typing PR to support 0xd05 core mentioned, ARMv8 + arch flags

martin-frbg commented 3 years ago

alright then, so I guess I have been talked into ordering a C4...

brada4 commented 3 years ago

@Djip007 can you test https://github.com/brada4/OpenBLAS/tree/A55 ? - it is based on A53 , cache values set to minimum seen on wikipedia page, old compiler support idea snooped from neoverse.

Djip007 commented 3 years ago

git clone on Odroid C4... OK... build with DYNAMIC_ARCH=1 end with:

 OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)

  OS               ... Linux             
  Architecture     ... arm64               
  BINARY           ... 64bit                 
  C compiler       ... GCC  (cmd & version : cc (Ubuntu 10.2.0-13ubuntu1) 10.2.0)
  Fortran compiler ... GFORTRAN  (cmd & version : GNU Fortran (Ubuntu 10.2.0-13ubuntu1) 10.2.0)
  Library Name     ... libopenblasp-r0.3.15.dev.a (Multi-threading; Max num-threads is 4)
  Supporting multiple arm64 cpu models with minimum requirement for the common code being CORTEXA55

can make more / different test or give more hardware info... simply ask !!!

some context:

> uname -a
Linux focal-minimal 5.11.0-odroid-arm64 #1 SMP PREEMPT Ubuntu 5.11.18-202105111802~groovy (2021-05-11) aarch64 aarch64 aarch64 GNU/Linux
> gcc --version
gcc (Ubuntu 10.2.0-13ubuntu1) 10.2.0

from https://www.hardkernel.com/shop/odroid-c4/

Amlogic S905X3 12nm Processor
L1 instruction cache: 32 KB, 4-way set associative (128 sets), 64 byte lines, shared by 1 processor
L1 data cache: 32 KB, 4-way set associative (128 sets), 64 byte lines, shared by 1 processor
L3 data cache: 512KB , 16-way set associative (512 sets), 64 byte lines, shared by 4 processors
Quad-Core Cortex-A55 (2.016GHz)
ARMv8-A architecture with Neon and Crypto extensions
Mali-G31 MP2 GPU with 4 x Execution Engines (650Mhz)

(lscpu dont report cache size... )

Djip007 commented 3 years ago

Last for today... my bed wait for me ;)

odroid@focal-minimal:~/Developement/OpenBLAS/benchmark$ export OMP_NUM_THREADS=4
odroid@focal-minimal:~/Developement/OpenBLAS/benchmark$ ./sgemm.goto
From :   1  To : 200 Step=1 : Transa=N : Transb=N
          SIZE                   Flops             Time
 M=   1, N=   1, K=   1 :        0.02 MFlops   0.000087 sec
 M=   2, N=   2, K=   2 :        4.22 MFlops   0.000004 sec
 M=   3, N=   3, K=   3 :       17.28 MFlops   0.000003 sec
 M=   4, N=   4, K=   4 :       36.14 MFlops   0.000004 sec
 M=   5, N=   5, K=   5 :       65.21 MFlops   0.000004 sec
 M=   6, N=   6, K=   6 :      112.68 MFlops   0.000004 sec
 M=   7, N=   7, K=   7 :      140.69 MFlops   0.000005 sec
 M=   8, N=   8, K=   8 :      264.26 MFlops   0.000004 sec
 M=   9, N=   9, K=   9 :      296.52 MFlops   0.000005 sec
 M=  10, N=  10, K=  10 :      413.74 MFlops   0.000005 sec
 M=  11, N=  11, K=  11 :      484.00 MFlops   0.000006 sec
 M=  12, N=  12, K=  12 :      829.37 MFlops   0.000004 sec
 M=  13, N=  13, K=  13 :      769.80 MFlops   0.000006 sec
 M=  14, N=  14, K=  14 :      997.82 MFlops   0.000006 sec
 M=  15, N=  15, K=  15 :     1006.11 MFlops   0.000007 sec
 M=  16, N=  16, K=  16 :     1228.74 MFlops   0.000007 sec
 M=  17, N=  17, K=  17 :     1371.01 MFlops   0.000007 sec
 M=  18, N=  18, K=  18 :     1521.33 MFlops   0.000008 sec
 M=  19, N=  19, K=  19 :     1621.90 MFlops   0.000008 sec
 M=  20, N=  20, K=  20 :     2086.87 MFlops   0.000008 sec
 M=  21, N=  21, K=  21 :     1975.68 MFlops   0.000009 sec
 M=  22, N=  22, K=  22 :     2129.60 MFlops   0.000010 sec
 M=  23, N=  23, K=  23 :     2070.98 MFlops   0.000012 sec
 M=  24, N=  24, K=  24 :     2764.80 MFlops   0.000010 sec
 M=  25, N=  25, K=  25 :     2631.36 MFlops   0.000012 sec
 M=  26, N=  26, K=  26 :     2929.09 MFlops   0.000012 sec
 M=  27, N=  27, K=  27 :     2151.97 MFlops   0.000018 sec
 M=  28, N=  28, K=  28 :     3344.81 MFlops   0.000013 sec
 M=  29, N=  29, K=  29 :     3080.59 MFlops   0.000016 sec
 M=  30, N=  30, K=  30 :     3374.79 MFlops   0.000016 sec
 M=  31, N=  31, K=  31 :     3023.09 MFlops   0.000020 sec
 M=  32, N=  32, K=  32 :     1291.30 MFlops   0.000051 sec
 M=  33, N=  33, K=  33 :     1901.73 MFlops   0.000038 sec
 M=  34, N=  34, K=  34 :     4092.25 MFlops   0.000019 sec
 M=  35, N=  35, K=  35 :     3825.22 MFlops   0.000022 sec
 M=  36, N=  36, K=  36 :     4744.36 MFlops   0.000020 sec
 M=  37, N=  37, K=  37 :     4235.55 MFlops   0.000024 sec
 M=  38, N=  38, K=  38 :     1881.24 MFlops   0.000058 sec
 M=  39, N=  39, K=  39 :     4073.27 MFlops   0.000029 sec
 M=  40, N=  40, K=  40 :     2203.63 MFlops   0.000058 sec
 M=  41, N=  41, K=  41 :     4753.01 MFlops   0.000029 sec
 M=  42, N=  42, K=  42 :     4911.69 MFlops   0.000030 sec
 M=  43, N=  43, K=  43 :     4463.43 MFlops   0.000036 sec
 M=  44, N=  44, K=  44 :     5316.86 MFlops   0.000032 sec
 M=  45, N=  45, K=  45 :     4759.36 MFlops   0.000038 sec
 M=  46, N=  46, K=  46 :     4145.49 MFlops   0.000047 sec
 M=  47, N=  47, K=  47 :     4449.33 MFlops   0.000047 sec
 M=  48, N=  48, K=  48 :     2054.26 MFlops   0.000108 sec
 M=  49, N=  49, K=  49 :     5204.44 MFlops   0.000045 sec
 M=  50, N=  50, K=  50 :     5012.33 MFlops   0.000050 sec
 M=  51, N=  51, K=  51 :     2919.26 MFlops   0.000091 sec
 M=  52, N=  52, K=  52 :     5695.17 MFlops   0.000049 sec
 M=  53, N=  53, K=  53 :     5196.95 MFlops   0.000057 sec
 M=  54, N=  54, K=  54 :     5367.88 MFlops   0.000059 sec
 M=  55, N=  55, K=  55 :     4953.92 MFlops   0.000067 sec
 M=  56, N=  56, K=  56 :     5203.21 MFlops   0.000068 sec
 M=  57, N=  57, K=  57 :     5493.79 MFlops   0.000067 sec
 M=  58, N=  58, K=  58 :     5604.41 MFlops   0.000070 sec
 M=  59, N=  59, K=  59 :     5299.90 MFlops   0.000078 sec
 M=  60, N=  60, K=  60 :     6080.65 MFlops   0.000071 sec
 M=  61, N=  61, K=  61 :     3899.28 MFlops   0.000116 sec
 M=  62, N=  62, K=  62 :     5754.14 MFlops   0.000083 sec
 M=  63, N=  63, K=  63 :     5462.76 MFlops   0.000092 sec
 M=  64, N=  64, K=  64 :     4687.92 MFlops   0.000112 sec
 M=  65, N=  65, K=  65 :     1036.93 MFlops   0.000530 sec
 M=  66, N=  66, K=  66 :     9898.98 MFlops   0.000058 sec
 M=  67, N=  67, K=  67 :    13516.83 MFlops   0.000045 sec
 M=  68, N=  68, K=  68 :    12639.72 MFlops   0.000050 sec
 M=  69, N=  69, K=  69 :    14154.07 MFlops   0.000046 sec
 M=  70, N=  70, K=  70 :    14725.45 MFlops   0.000047 sec
 M=  71, N=  71, K=  71 :    15055.99 MFlops   0.000048 sec
 M=  72, N=  72, K=  72 :    15632.77 MFlops   0.000048 sec
 M=  73, N=  73, K=  73 :    13810.60 MFlops   0.000056 sec
 M=  74, N=  74, K=  74 :    14428.74 MFlops   0.000056 sec
 M=  75, N=  75, K=  75 :    14494.68 MFlops   0.000058 sec
 M=  76, N=  76, K=  76 :     9220.93 MFlops   0.000095 sec
 M=  77, N=  77, K=  77 :    15651.84 MFlops   0.000058 sec
 M=  78, N=  78, K=  78 :    14994.93 MFlops   0.000063 sec
 M=  79, N=  79, K=  79 :    16771.75 MFlops   0.000059 sec
 M=  80, N=  80, K=  80 :    17859.63 MFlops   0.000057 sec
 M=  81, N=  81, K=  81 :    15601.25 MFlops   0.000068 sec
 M=  82, N=  82, K=  82 :    15696.64 MFlops   0.000070 sec
 M=  83, N=  83, K=  83 :    15682.58 MFlops   0.000073 sec
 M=  84, N=  84, K=  84 :    15839.87 MFlops   0.000075 sec
 M=  85, N=  85, K=  85 :    12190.46 MFlops   0.000101 sec
 M=  86, N=  86, K=  86 :     9716.49 MFlops   0.000131 sec
 M=  87, N=  87, K=  87 :    15662.42 MFlops   0.000084 sec
 M=  88, N=  88, K=  88 :    14346.02 MFlops   0.000095 sec
 M=  89, N=  89, K=  89 :    13269.38 MFlops   0.000106 sec
 M=  90, N=  90, K=  90 :    10349.16 MFlops   0.000141 sec
 M=  91, N=  91, K=  91 :    15637.66 MFlops   0.000096 sec
 M=  92, N=  92, K=  92 :    14325.57 MFlops   0.000109 sec
 M=  93, N=  93, K=  93 :    17030.10 MFlops   0.000094 sec
 M=  94, N=  94, K=  94 :    17901.29 MFlops   0.000093 sec
 M=  95, N=  95, K=  95 :    18396.24 MFlops   0.000093 sec
 M=  96, N=  96, K=  96 :     9861.90 MFlops   0.000179 sec
 M=  97, N=  97, K=  97 :    11135.25 MFlops   0.000164 sec
 M=  98, N=  98, K=  98 :    16957.65 MFlops   0.000111 sec
 M=  99, N=  99, K=  99 :    12883.04 MFlops   0.000151 sec
 M= 100, N= 100, K= 100 :    12370.57 MFlops   0.000162 sec
 M= 101, N= 101, K= 101 :    12319.90 MFlops   0.000167 sec
 M= 102, N= 102, K= 102 :    12957.44 MFlops   0.000164 sec
 M= 103, N= 103, K= 103 :    16029.56 MFlops   0.000136 sec
 M= 104, N= 104, K= 104 :    18470.98 MFlops   0.000122 sec
 M= 105, N= 105, K= 105 :    16955.58 MFlops   0.000137 sec
 M= 106, N= 106, K= 106 :    17434.05 MFlops   0.000137 sec
 M= 107, N= 107, K= 107 :    17525.90 MFlops   0.000140 sec
 M= 108, N= 108, K= 108 :    18195.52 MFlops   0.000138 sec
 M= 109, N= 109, K= 109 :    12987.56 MFlops   0.000199 sec
 M= 110, N= 110, K= 110 :    16794.21 MFlops   0.000159 sec
 M= 111, N= 111, K= 111 :    18798.15 MFlops   0.000146 sec
 M= 112, N= 112, K= 112 :    12618.53 MFlops   0.000223 sec
 M= 113, N= 113, K= 113 :    11199.23 MFlops   0.000258 sec
 M= 114, N= 114, K= 114 :    12787.42 MFlops   0.000232 sec
 M= 115, N= 115, K= 115 :    17594.27 MFlops   0.000173 sec
 M= 116, N= 116, K= 116 :    18009.54 MFlops   0.000173 sec
 M= 117, N= 117, K= 117 :    12714.85 MFlops   0.000252 sec
 M= 118, N= 118, K= 118 :    10772.04 MFlops   0.000305 sec
 M= 119, N= 119, K= 119 :    12610.54 MFlops   0.000267 sec
 M= 120, N= 120, K= 120 :    13240.77 MFlops   0.000261 sec
 M= 121, N= 121, K= 121 :    11790.21 MFlops   0.000301 sec
 M= 122, N= 122, K= 122 :    16976.41 MFlops   0.000214 sec
 M= 123, N= 123, K= 123 :    17704.16 MFlops   0.000210 sec
 M= 124, N= 124, K= 124 :    13493.64 MFlops   0.000283 sec
 M= 125, N= 125, K= 125 :    15438.99 MFlops   0.000253 sec
 M= 126, N= 126, K= 126 :    19435.94 MFlops   0.000206 sec
 M= 127, N= 127, K= 127 :    13471.91 MFlops   0.000304 sec
 M= 128, N= 128, K= 128 :    19020.92 MFlops   0.000221 sec
 M= 129, N= 129, K= 129 :    13162.56 MFlops   0.000326 sec
 M= 130, N= 130, K= 130 :    15448.50 MFlops   0.000284 sec
 M= 131, N= 131, K= 131 :    15507.82 MFlops   0.000290 sec
 M= 132, N= 132, K= 132 :    15370.83 MFlops   0.000299 sec
 M= 133, N= 133, K= 133 :    18445.18 MFlops   0.000255 sec
 M= 134, N= 134, K= 134 :    15402.47 MFlops   0.000312 sec
 M= 135, N= 135, K= 135 :    20287.40 MFlops   0.000243 sec
 M= 136, N= 136, K= 136 :    14670.29 MFlops   0.000343 sec
 M= 137, N= 137, K= 137 :    14336.11 MFlops   0.000359 sec
 M= 138, N= 138, K= 138 :    19538.70 MFlops   0.000269 sec
 M= 139, N= 139, K= 139 :    14747.03 MFlops   0.000364 sec
 M= 140, N= 140, K= 140 :    15336.03 MFlops   0.000358 sec
 M= 141, N= 141, K= 141 :    16255.74 MFlops   0.000345 sec
 M= 142, N= 142, K= 142 :    15450.42 MFlops   0.000371 sec
 M= 143, N= 143, K= 143 :    16612.12 MFlops   0.000352 sec
 M= 144, N= 144, K= 144 :    16823.63 MFlops   0.000355 sec
 M= 145, N= 145, K= 145 :    14928.29 MFlops   0.000408 sec
 M= 146, N= 146, K= 146 :    15369.45 MFlops   0.000405 sec
 M= 147, N= 147, K= 147 :    15768.53 MFlops   0.000403 sec
 M= 148, N= 148, K= 148 :    15058.42 MFlops   0.000431 sec
 M= 149, N= 149, K= 149 :    15053.96 MFlops   0.000439 sec
 M= 150, N= 150, K= 150 :    14957.75 MFlops   0.000451 sec
 M= 151, N= 151, K= 151 :    15023.11 MFlops   0.000458 sec
 M= 152, N= 152, K= 152 :    12929.29 MFlops   0.000543 sec
 M= 153, N= 153, K= 153 :    14474.00 MFlops   0.000495 sec
 M= 154, N= 154, K= 154 :    14027.42 MFlops   0.000521 sec
 M= 155, N= 155, K= 155 :    14984.72 MFlops   0.000497 sec
 M= 156, N= 156, K= 156 :    15465.94 MFlops   0.000491 sec
 M= 157, N= 157, K= 157 :    15972.68 MFlops   0.000485 sec
 M= 158, N= 158, K= 158 :    16573.44 MFlops   0.000476 sec
 M= 159, N= 159, K= 159 :    16939.05 MFlops   0.000475 sec
 M= 160, N= 160, K= 160 :    17911.69 MFlops   0.000457 sec
 M= 161, N= 161, K= 161 :    14211.29 MFlops   0.000587 sec
 M= 162, N= 162, K= 162 :    16054.13 MFlops   0.000530 sec
 M= 163, N= 163, K= 163 :    16310.93 MFlops   0.000531 sec
 M= 164, N= 164, K= 164 :    16062.20 MFlops   0.000549 sec
 M= 165, N= 165, K= 165 :    15001.20 MFlops   0.000599 sec
 M= 166, N= 166, K= 166 :    15550.39 MFlops   0.000588 sec
 M= 167, N= 167, K= 167 :    15845.48 MFlops   0.000588 sec
 M= 168, N= 168, K= 168 :    16130.69 MFlops   0.000588 sec
 M= 169, N= 169, K= 169 :    14260.50 MFlops   0.000677 sec
 M= 170, N= 170, K= 170 :    14064.13 MFlops   0.000699 sec
 M= 171, N= 171, K= 171 :    14437.77 MFlops   0.000693 sec
 M= 172, N= 172, K= 172 :    14656.41 MFlops   0.000694 sec
 M= 173, N= 173, K= 173 :    13932.79 MFlops   0.000743 sec
 M= 174, N= 174, K= 174 :    16257.53 MFlops   0.000648 sec
 M= 175, N= 175, K= 175 :    16307.74 MFlops   0.000657 sec
 M= 176, N= 176, K= 176 :    15913.97 MFlops   0.000685 sec
 M= 177, N= 177, K= 177 :    14618.51 MFlops   0.000759 sec
 M= 178, N= 178, K= 178 :    14746.18 MFlops   0.000765 sec
 M= 179, N= 179, K= 179 :    14762.11 MFlops   0.000777 sec
 M= 180, N= 180, K= 180 :    14378.65 MFlops   0.000811 sec
 M= 181, N= 181, K= 181 :    15442.17 MFlops   0.000768 sec
 M= 182, N= 182, K= 182 :    15755.94 MFlops   0.000765 sec
 M= 183, N= 183, K= 183 :    15225.35 MFlops   0.000805 sec
 M= 184, N= 184, K= 184 :    16239.56 MFlops   0.000767 sec
 M= 185, N= 185, K= 185 :    15588.82 MFlops   0.000812 sec
 M= 186, N= 186, K= 186 :    15787.11 MFlops   0.000815 sec
 M= 187, N= 187, K= 187 :    16016.09 MFlops   0.000817 sec
 M= 188, N= 188, K= 188 :    16331.07 MFlops   0.000814 sec
 M= 189, N= 189, K= 189 :    18869.52 MFlops   0.000716 sec
 M= 190, N= 190, K= 190 :    17156.55 MFlops   0.000800 sec
 M= 191, N= 191, K= 191 :    17593.90 MFlops   0.000792 sec
 M= 192, N= 192, K= 192 :    20450.47 MFlops   0.000692 sec
 M= 193, N= 193, K= 193 :    15700.20 MFlops   0.000916 sec
 M= 194, N= 194, K= 194 :    16886.75 MFlops   0.000865 sec
 M= 195, N= 195, K= 195 :    16712.74 MFlops   0.000887 sec
 M= 196, N= 196, K= 196 :    16310.19 MFlops   0.000923 sec
 M= 197, N= 197, K= 197 :    16423.24 MFlops   0.000931 sec
 M= 198, N= 198, K= 198 :    17905.50 MFlops   0.000867 sec
 M= 199, N= 199, K= 199 :    16326.47 MFlops   0.000965 sec
 M= 200, N= 200, K= 200 :    18566.03 MFlops   0.000862 sec

brada4 commented 3 years ago

It is smallest change possible, just to get through make without complaining that CPU is not detected, with only 8.2 CFLAGS added on top of A53. That worked out, thank you for test. Benchmark flags are FROM TO STEP LOOPS , from source - 1 200 1 10 by default, better to test higher values to see performance stabilize, e.g 128 25600 128, for small matrices the actual library call is disproportionately heavy.

brada4 commented 3 years ago

pthread library libXXXXp.so.N prefers OPENBLAS_NUM_THREADS variable, either way defaulting to all cores detected at runtime if no variable was set.

Djip007 commented 3 years ago

It is smallest change possible, just to get through make without complaining that CPU is not detected, with only 8.2 CFLAGS added on top of A53. That worked out, thank you for test.

I see... as exemple during build:

cc -O2 -DMAX_STACK_ALLOC=2048 -Wall -DF_INTERFACE_GFORT -fPIC -DDYNAMIC_ARCH -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=4 -DMAX_PARALLEL_NUMBER=1 -DBUILD_SINGLE=1 -DBUILD_DOUBLE=1 -DBUILD_COMPLEX=1 -DBUILD_COMPLEX16=1 -DVERSION="0.3.15.dev" -march=armv8.2-a -mtune=cortex-a55 -UASMNAME -UASMFNAME -UNAME -UCNAME -UCHAR_NAME -UCHAR_CNAME -DASMNAME= -DASMFNAME=_ -DNAME=_ -DCNAME= -DCHAR_NAME="_" -DCHAR_CNAME="" -DNO_AFFINITY -I. -DHAVE_LAPACK_CONFIG_H -I../include -c -o lapacke_cgesvj.o lapacke_cgesvj.c

so yes... -march=armv8.2-a -mtune=cortex-a55 is use !

Benchmark flags are FROM TO STEP LOOPS , from source - 1 200 1 10 by default, better to test higher values to see performance stabilize, e.g 128 25600 128, for small matrices the actual library call is disproportionately heavy. Ho... that what I am looking for:

./sgemm.goto 128 8192 128
From : 128  To : 8192 Step=128 : Transa=N : Transb=N
SIZE                   Flops             Time
M= 128, N= 128, K= 128 :     1031.74 MFlops   0.004065 sec
M= 256, N= 256, K= 256 :    18378.43 MFlops   0.001826 sec
M= 384, N= 384, K= 384 :    21049.99 MFlops   0.005380 sec
M= 512, N= 512, K= 512 :    21357.46 MFlops   0.012569 sec
M= 640, N= 640, K= 640 :    22184.38 MFlops   0.023633 sec
M= 768, N= 768, K= 768 :    22264.49 MFlops   0.040691 sec
M= 896, N= 896, K= 896 :    21289.78 MFlops   0.067575 sec
M=1024, N=1024, K=1024 :    17935.04 MFlops   0.119737 sec
M=1152, N=1152, K=1152 :    21167.13 MFlops   0.144453 sec
M=1280, N=1280, K=1280 :    21098.58 MFlops   0.198796 sec
M=1408, N=1408, K=1408 :    19357.51 MFlops   0.288396 sec
M=1536, N=1536, K=1536 :    19343.65 MFlops   0.374684 sec
M=1664, N=1664, K=1664 :    20310.36 MFlops   0.453704 sec
M=1792, N=1792, K=1792 :    20490.72 MFlops   0.561677 sec
M=1920, N=1920, K=1920 :    20139.90 MFlops   0.702872 sec
M=2048, N=2048, K=2048 :    18723.60 MFlops   0.917551 sec
M=2176, N=2176, K=2176 :    20026.33 MFlops   1.028976 sec
M=2304, N=2304, K=2304 :    20157.99 MFlops   1.213473 sec
M=2432, N=2432, K=2432 :    19163.93 MFlops   1.501192 sec
M=2560, N=2560, K=2560 :    18907.80 MFlops   1.774635 sec
M=2688, N=2688, K=2688 :    19686.54 MFlops   1.973097 sec
M=2816, N=2816, K=2816 :    19118.95 MFlops   2.335952 sec
M=2944, N=2944, K=2944 :    19484.95 MFlops   2.619052 sec
M=3072, N=3072, K=3072 :    18701.53 MFlops   3.100391 sec
M=3200, N=3200, K=3200 :    19414.86 MFlops   3.375559 sec
M=3328, N=3328, K=3328 :    19594.29 MFlops   3.762275 sec
M=3456, N=3456, K=3456 :    19075.85 MFlops   4.327801 sec
M=3584, N=3584, K=3584 :    18677.22 MFlops   4.929715 sec
M=3712, N=3712, K=3712 :    19331.88 MFlops   5.291513 sec
M=3840, N=3840, K=3840 :    19003.84 MFlops   5.959121 sec
M=3968, N=3968, K=3968 :    19109.40 MFlops   6.538798 sec
M=4096, N=4096, K=4096 :    18778.42 MFlops   7.318984 sec
M=4224, N=4224, K=4224 :    18665.08 MFlops   8.075544 sec
M=4352, N=4352, K=4352 :    19200.18 MFlops   8.586010 sec
M=4480, N=4480, K=4480 :    18945.48 MFlops   9.492017 sec
M=4608, N=4608, K=4608 :    18544.78 MFlops  10.552268 sec
M=4736, N=4736, K=4736 :    19122.88 MFlops  11.109940 sec
M=4864, N=4864, K=4864 :    18928.97 MFlops  12.158607 sec
M=4992, N=4992, K=4992 :    18915.15 MFlops  13.153581 sec
M=5120, N=5120, K=5120 :    18587.85 MFlops  14.441447 sec
M=5248, N=5248, K=5248 :    18706.29 MFlops  15.453394 sec
M=5376, N=5376, K=5376 :    19005.89 MFlops  16.350066 sec
M=5504, N=5504, K=5504 :    18798.98 MFlops  17.739079 sec
M=5632, N=5632, K=5632 :    18139.08 MFlops  19.697114 sec
M=5760, N=5760, K=5760 :    18912.99 MFlops  20.208645 sec
M=5888, N=5888, K=5888 :    18831.47 MFlops  21.679493 sec
M=6016, N=6016, K=6016 :    18734.62 MFlops  23.243884 sec
M=6144, N=6144, K=6144 :    18516.06 MFlops  25.051579 sec
M=6272, N=6272, K=6272 :    18690.58 MFlops  26.401302 sec
M=6400, N=6400, K=6400 :    18816.78 MFlops  27.862793 sec
M=6528, N=6528, K=6528 :    18772.51 MFlops  29.637947 sec
M=6656, N=6656, K=6656 :    18235.91 MFlops  32.340190 sec
M=6784, N=6784, K=6784 :    18739.55 MFlops  33.321792 sec
M=6912, N=6912, K=6912 :    18759.65 MFlops  35.205979 sec
M=7040, N=7040, K=7040 :    18353.57 MFlops  38.021342 sec
M=7168, N=7168, K=7168 :    18446.08 MFlops  39.931883 sec
M=7296, N=7296, K=7296 :    18648.52 MFlops  41.652405 sec
M=7424, N=7424, K=7424 :    18682.23 MFlops  43.804141 sec
M=7552, N=7552, K=7552 :    18649.29 MFlops  46.190609 sec
M=7680, N=7680, K=7680 :    18248.59 MFlops  49.646010 sec
M=7808, N=7808, K=7808 :    18606.54 MFlops  51.166263 sec
M=7936, N=7936, K=7936 :    18644.15 MFlops  53.615762 sec
M=8064, N=8064, K=8064 :    18391.70 MFlops  57.024267 sec
M=8192, N=8192, K=8192 :    18484.47 MFlops  59.482996 sec

(CPU clock is 2GHz 4 cores...)

brada4 commented 3 years ago

Just explanation on anomalous results at start: Very small samples measure library call latency too (and see below) Then anomalously high ones pop put as long as data set fits in outermost cache Then it gets stable

Could you retry setting OPENBLAS_NUM_THREADS=1 , just until 128 2048 128 , to see if threading threshold is good for modern low-power CPUs....

Djip007 commented 3 years ago

yep:

#> export OPENBLAS_NUM_THREADS=1
#> ./sgemm.goto 128 2048 128
From : 128  To : 2048 Step=128 : Transa=N : Transb=N
          SIZE                   Flops             Time
 M= 128, N= 128, K= 128 :     5310.97 MFlops   0.000790 sec
 M= 256, N= 256, K= 256 :     7105.72 MFlops   0.004722 sec
 M= 384, N= 384, K= 384 :     6971.35 MFlops   0.016245 sec
 M= 512, N= 512, K= 512 :     7372.43 MFlops   0.036411 sec
 M= 640, N= 640, K= 640 :     7805.11 MFlops   0.067172 sec
 M= 768, N= 768, K= 768 :     7615.81 MFlops   0.118959 sec
 M= 896, N= 896, K= 896 :     7736.69 MFlops   0.185951 sec
 M=1024, N=1024, K=1024 :     7806.50 MFlops   0.275089 sec
 M=1152, N=1152, K=1152 :     7824.91 MFlops   0.390758 sec
 M=1280, N=1280, K=1280 :     7906.44 MFlops   0.530492 sec
 M=1408, N=1408, K=1408 :     7965.56 MFlops   0.700845 sec
 M=1536, N=1536, K=1536 :     7939.69 MFlops   0.912852 sec
 M=1664, N=1664, K=1664 :     8000.37 MFlops   1.151807 sec
 M=1792, N=1792, K=1792 :     7989.45 MFlops   1.440546 sec
 M=1920, N=1920, K=1920 :     8034.74 MFlops   1.761822 sec
 M=2048, N=2048, K=2048 :     7927.88 MFlops   2.167019 sec

#> export OPENBLAS_NUM_THREADS=2
#> ./sgemm.goto 128 2048 128
From : 128  To : 2048 Step=128 : Transa=N : Transb=N
          SIZE                   Flops             Time
 M= 128, N= 128, K= 128 :     8365.92 MFlops   0.000501 sec
 M= 256, N= 256, K= 256 :    13199.85 MFlops   0.002542 sec
 M= 384, N= 384, K= 384 :    13212.00 MFlops   0.008571 sec
 M= 512, N= 512, K= 512 :    14562.14 MFlops   0.018434 sec
 M= 640, N= 640, K= 640 :    15030.35 MFlops   0.034882 sec
 M= 768, N= 768, K= 768 :    14583.59 MFlops   0.062123 sec
 M= 896, N= 896, K= 896 :    15003.05 MFlops   0.095890 sec
 M=1024, N=1024, K=1024 :    15333.30 MFlops   0.140054 sec
 M=1152, N=1152, K=1152 :    15258.89 MFlops   0.200385 sec
 M=1280, N=1280, K=1280 :    15498.77 MFlops   0.270622 sec
 M=1408, N=1408, K=1408 :    15540.03 MFlops   0.359241 sec
 M=1536, N=1536, K=1536 :    15644.70 MFlops   0.463272 sec
 M=1664, N=1664, K=1664 :    15681.72 MFlops   0.587619 sec
 M=1792, N=1792, K=1792 :    15675.84 MFlops   0.734198 sec
 M=1920, N=1920, K=1920 :    15758.74 MFlops   0.898281 sec
 M=2048, N=2048, K=2048 :    15629.51 MFlops   1.099194 sec

note: that bad for this CPU with no L2 cache ... may be need some cache tweak with 4 Core

3279 add tracking issus for gemm perf on A53/A55 ...

martin-frbg commented 3 years ago

Closing due to massive drift from Graviton2 to CortexA55, both have their own tickets now.

OpenMathLib / OpenBLAS

Additional Aarch64 features? #3251

3279 add tracking issus for gemm perf on A53/A55 ...