OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.38k stars 1.5k forks source link

sgemm on arm A53/A55 #3279

Closed Djip007 closed 3 years ago

Djip007 commented 3 years ago

while searching optim on A53/A55 core for gemm compute I find that (don't test it...)

> A53

https://github.com/google/gemmlowp/blob/master/standalone/neon-gemm-kernel-benchmark.cc#L4215

> A55

https://github.com/google/gemmlowp/blob/master/standalone/neon-gemm-kernel-benchmark.cc#L4375

(juste remainder for future optim... if it can help :thinking: ) A53: https://developer.arm.com/ip-products/processors/cortex-a/cortex-a53 L1 I-Cache / D-Cache | 8KB to 64KB / per Core L2 Cache | 128KB to 2MB / For all Core A55: https://developer.arm.com/ip-products/processors/cortex-a/cortex-a55
L1 I-Cache / D-Cache | 16KB to 64KB / per Core L2 Cache | Optional, 64KB to 256KB / per Core L3 Cache | Optional, 512KB to 4MB / For all Core

(https://en.wikipedia.org/wiki/Comparison_of_ARMv8-A_cores)

Djip007 commented 3 years ago

for bench starting point on ODROID_C4 with https://github.com/xianyi/OpenBLAS/pull/3278:

  OS               ... Linux             
  Architecture     ... arm64               
  BINARY           ... 64bit                 
  C compiler       ... GCC  (cmd & version : cc (Ubuntu 10.2.0-13ubuntu1) 10.2.0)
  Library Name     ... libopenblasp-r0.3.15.dev.a (Multi-threading; Max num-threads is 4)
  Supporting multiple arm64 cpu models with minimum requirement for the common
#>  export OPENBLAS_NUM_THREADS=1
#>  ./sgemm.goto 64 2048 64
From :  64  To : 2048 Step=64 : Transa=N : Transb=N
          SIZE                   Flops             Time
 M=  64, N=  64, K=  64 :      773.92 MFlops   0.000677 sec
 M= 128, N= 128, K= 128 :     7689.75 MFlops   0.000545 sec
 M= 192, N= 192, K= 192 :     8772.71 MFlops   0.001614 sec
 M= 256, N= 256, K= 256 :     8888.19 MFlops   0.003775 sec
 M= 320, N= 320, K= 320 :     8991.65 MFlops   0.007289 sec
 M= 384, N= 384, K= 384 :     9257.18 MFlops   0.012233 sec
 M= 448, N= 448, K= 448 :     9550.60 MFlops   0.018829 sec
 M= 512, N= 512, K= 512 :     9000.67 MFlops   0.029824 sec
 M= 576, N= 576, K= 576 :     9144.98 MFlops   0.041794 sec
 M= 640, N= 640, K= 640 :     9318.94 MFlops   0.056261 sec
 M= 704, N= 704, K= 704 :     9514.07 MFlops   0.073347 sec
 M= 768, N= 768, K= 768 :     9214.26 MFlops   0.098323 sec
 M= 832, N= 832, K= 832 :     9316.90 MFlops   0.123631 sec
 M= 896, N= 896, K= 896 :     9418.17 MFlops   0.152752 sec
 M= 960, N= 960, K= 960 :     9551.00 MFlops   0.185266 sec
 M=1024, N=1024, K=1024 :     9102.13 MFlops   0.235932 sec
 M=1088, N=1088, K=1088 :     9420.38 MFlops   0.273431 sec
 M=1152, N=1152, K=1152 :     9484.84 MFlops   0.322372 sec
 M=1216, N=1216, K=1216 :     9589.88 MFlops   0.374988 sec
 M=1280, N=1280, K=1280 :     9383.93 MFlops   0.446967 sec
 M=1344, N=1344, K=1344 :     9469.61 MFlops   0.512738 sec
 M=1408, N=1408, K=1408 :     9514.86 MFlops   0.586726 sec
 M=1472, N=1472, K=1472 :     9606.63 MFlops   0.664022 sec
 M=1536, N=1536, K=1536 :     9384.00 MFlops   0.772353 sec
 M=1600, N=1600, K=1600 :     9510.84 MFlops   0.861333 sec
 M=1664, N=1664, K=1664 :     9538.33 MFlops   0.966090 sec
 M=1728, N=1728, K=1728 :     9620.13 MFlops   1.072705 sec
 M=1792, N=1792, K=1792 :     9466.88 MFlops   1.215730 sec
 M=1856, N=1856, K=1856 :     9534.52 MFlops   1.341112 sec
 M=1920, N=1920, K=1920 :     9554.71 MFlops   1.481549 sec
 M=1984, N=1984, K=1984 :     9599.58 MFlops   1.627058 sec
 M=2048, N=2048, K=2048 :     8911.33 MFlops   1.927868 sec

#>  export OPENBLAS_NUM_THREADS=2
#>  ./sgemm.goto 64 2048 64
From :  64  To : 2048 Step=64 : Transa=N : Transb=N
          SIZE                   Flops             Time
 M=  64, N=  64, K=  64 :     2846.68 MFlops   0.000184 sec
 M= 128, N= 128, K= 128 :     8354.12 MFlops   0.000502 sec
 M= 192, N= 192, K= 192 :    13129.40 MFlops   0.001078 sec
 M= 256, N= 256, K= 256 :    14472.86 MFlops   0.002318 sec
 M= 320, N= 320, K= 320 :    16703.94 MFlops   0.003923 sec
 M= 384, N= 384, K= 384 :    17262.80 MFlops   0.006560 sec
 M= 448, N= 448, K= 448 :    17801.29 MFlops   0.010102 sec
 M= 512, N= 512, K= 512 :    17131.49 MFlops   0.015669 sec
 M= 576, N= 576, K= 576 :    17916.53 MFlops   0.021333 sec
 M= 640, N= 640, K= 640 :    18156.16 MFlops   0.028877 sec
 M= 704, N= 704, K= 704 :    18465.99 MFlops   0.037790 sec
 M= 768, N= 768, K= 768 :    18193.33 MFlops   0.049797 sec
 M= 832, N= 832, K= 832 :    18492.63 MFlops   0.062288 sec
 M= 896, N= 896, K= 896 :    18517.73 MFlops   0.077690 sec
 M= 960, N= 960, K= 960 :    18551.33 MFlops   0.095382 sec
 M=1024, N=1024, K=1024 :    17880.84 MFlops   0.120100 sec
 M=1088, N=1088, K=1088 :    18652.22 MFlops   0.138098 sec
 M=1152, N=1152, K=1152 :    18731.87 MFlops   0.163232 sec
 M=1216, N=1216, K=1216 :    18850.07 MFlops   0.190773 sec
 M=1280, N=1280, K=1280 :    18637.52 MFlops   0.225046 sec
 M=1344, N=1344, K=1344 :    18821.93 MFlops   0.257967 sec
 M=1408, N=1408, K=1408 :    18830.02 MFlops   0.296474 sec
 M=1472, N=1472, K=1472 :    18845.06 MFlops   0.338498 sec
 M=1536, N=1536, K=1536 :    18401.62 MFlops   0.393865 sec
 M=1600, N=1600, K=1600 :    18893.69 MFlops   0.433584 sec
 M=1664, N=1664, K=1664 :    18929.47 MFlops   0.486801 sec
 M=1728, N=1728, K=1728 :    19004.37 MFlops   0.543010 sec
 M=1792, N=1792, K=1792 :    18809.49 MFlops   0.611881 sec
 M=1856, N=1856, K=1856 :    18967.91 MFlops   0.674131 sec
 M=1920, N=1920, K=1920 :    18933.26 MFlops   0.747667 sec
 M=1984, N=1984, K=1984 :    18904.88 MFlops   0.826192 sec
 M=2048, N=2048, K=2048 :    17848.42 MFlops   0.962543 sec

#> export OPENBLAS_NUM_THREADS=4
#> ./sgemm.goto 64 2048 64
From :  64  To : 2048 Step=64 : Transa=N : Transb=N
          SIZE                   Flops             Time
 M=  64, N=  64, K=  64 :     3033.35 MFlops   0.000173 sec
 M= 128, N= 128, K= 128 :     4862.04 MFlops   0.000863 sec
 M= 192, N= 192, K= 192 :    17725.30 MFlops   0.000799 sec
 M= 256, N= 256, K= 256 :    20695.71 MFlops   0.001621 sec
 M= 320, N= 320, K= 320 :    26105.75 MFlops   0.002510 sec
 M= 384, N= 384, K= 384 :    27465.15 MFlops   0.004123 sec
 M= 448, N= 448, K= 448 :    28901.49 MFlops   0.006222 sec
 M= 512, N= 512, K= 512 :    28086.69 MFlops   0.009557 sec
 M= 576, N= 576, K= 576 :    29870.26 MFlops   0.012796 sec
 M= 640, N= 640, K= 640 :    29327.49 MFlops   0.017877 sec
 M= 704, N= 704, K= 704 :    27823.31 MFlops   0.025081 sec
 M= 768, N= 768, K= 768 :    24974.92 MFlops   0.036275 sec
 M= 832, N= 832, K= 832 :    26460.19 MFlops   0.043532 sec
 M= 896, N= 896, K= 896 :    25055.48 MFlops   0.057418 sec
 M= 960, N= 960, K= 960 :    24001.33 MFlops   0.073724 sec
 M=1024, N=1024, K=1024 :    22710.70 MFlops   0.094558 sec
 M=1088, N=1088, K=1088 :    32209.15 MFlops   0.079972 sec
 M=1152, N=1152, K=1152 :    31743.49 MFlops   0.096324 sec
 M=1216, N=1216, K=1216 :    31219.87 MFlops   0.115186 sec
 M=1280, N=1280, K=1280 :    29439.04 MFlops   0.142474 sec
 M=1344, N=1344, K=1344 :    29940.74 MFlops   0.162168 sec
 M=1408, N=1408, K=1408 :    29217.83 MFlops   0.191069 sec
 M=1472, N=1472, K=1472 :    28174.27 MFlops   0.226413 sec
 M=1536, N=1536, K=1536 :    26641.53 MFlops   0.272047 sec
 M=1600, N=1600, K=1600 :    27344.08 MFlops   0.299590 sec
 M=1664, N=1664, K=1664 :    26539.66 MFlops   0.347212 sec
 M=1728, N=1728, K=1728 :    25549.72 MFlops   0.403901 sec
 M=1792, N=1792, K=1792 :    24472.63 MFlops   0.470287 sec
 M=1856, N=1856, K=1856 :    25624.16 MFlops   0.499016 sec
 M=1920, N=1920, K=1920 :    25253.27 MFlops   0.560552 sec
 M=1984, N=1984, K=1984 :    24178.38 MFlops   0.645993 sec
 M=2048, N=2048, K=2048 :    23265.46 MFlops   0.738428 sec
brada4 commented 3 years ago

C4 has zero L2 cache, thus L3 acts as one. 128 128 128 sample indeed is anomalous with all cores. Probably same on A53 and most CPUs.

brada4 commented 3 years ago

@Djip007 please check official tree after official pull request. One you tested was incomplete as @martin-frbg pointed out. You can patch git checkout with complete pull request. wget -O- https://github.com/xianyi/OpenBLAS/pull/3278.diff | patch

One you tested compiled A55 cores but never selected them dynamically, you were at generic ARMv8 always + fortran5 with 8.2A tuning.

It is not correct to put single issue for 53/55 , 55 is followup from 53 marketing-wise, though from generation of 75, probably more pessimality will be shared by new generation than by marketing lines.

brada4 commented 3 years ago

Regarding cache tweaks: https://en.wikipedia.org/wiki/ARM_Cortex-A55 I posted assumption on lowest values L1I 32K L1D 16K L2 64K L3 - whatever

You got L1 instruction cache: 32 KB, 4-way set associative (128 sets), 64 byte lines, shared by 1 processor L1 data cache: 32 KB, 4-way set associative (128 sets), 64 byte lines, shared by 1 processor L3 data cache: 512KB , 16-way set associative (512 sets), 64 byte lines, shared by 4 processors

That is you got double insn cache, and double of L3 cache (next level after L1) per core. Similar 8-core construct will still have my assumption correct.

One sunny day with infinite developer hands that will be selected dynamically, for now it is set to minimal configuration assumption, so that data do not spill to (half-speed) next level cache and (5-10x slower) main RAM during intense computation. It is like 10% sub-optimal for largest-caches server CPU-s, but certainly avoids 10-20x slowdown on raspberries.

Djip007 commented 3 years ago

next test with

#define ARCHITECTURE    "ARM64"
#define SUBARCHITECTURE "CORTEXA55"
#define SUBDIRNAME      "arm64"
#define ARCHCONFIG   "-DCORTEXA55 " \
       "-DL1_CODE_SIZE=32768 -DL1_CODE_LINESIZE=64 -DL1_CODE_ASSOCIATIVE=3 " \
       "-DL1_DATA_SIZE=32768 -DL1_DATA_LINESIZE=64 -DL1_DATA_ASSOCIATIVE=2 " \
       "-DL2_SIZE=131072 -DL2_LINESIZE=64 -DL2_ASSOCIATIVE=16 " \
       "-DDTB_DEFAULT_ENTRIES=64 -DDTB_SIZE=4096 " \
       "-DHAVE_VFPV4 -DHAVE_VFPV3 -DHAVE_VFP -DHAVE_NEON -DARMV8"
#define LIBNAME   "cortexa55"
#define CORENAME  "CORTEXA55"
odroid@focal-minimal:~/Developement/OpenBLAS/benchmark$ export OPENBLAS_NUM_THREADS=4
odroid@focal-minimal:~/Developement/OpenBLAS/benchmark$ ./sgemm.goto 128 2048 128
From : 128  To : 2048 Step=128 : Transa=N : Transb=N
          SIZE                   Flops             Time
 M= 128, N= 128, K= 128 :     4927.24 MFlops   0.000851 sec
 M= 256, N= 256, K= 256 :    19843.50 MFlops   0.001691 sec
 M= 384, N= 384, K= 384 :    27650.94 MFlops   0.004096 sec
 M= 512, N= 512, K= 512 :    28045.46 MFlops   0.009571 sec
 M= 640, N= 640, K= 640 :    29327.12 MFlops   0.017877 sec
 M= 768, N= 768, K= 768 :    24889.73 MFlops   0.036399 sec
 M= 896, N= 896, K= 896 :    25133.58 MFlops   0.057240 sec
 M=1024, N=1024, K=1024 :    22006.82 MFlops   0.097583 sec
 M=1152, N=1152, K=1152 :    32177.68 MFlops   0.095024 sec
 M=1280, N=1280, K=1280 :    29917.45 MFlops   0.140196 sec
 M=1408, N=1408, K=1408 :    29567.23 MFlops   0.188811 sec
 M=1536, N=1536, K=1536 :    26706.84 MFlops   0.271382 sec
 M=1664, N=1664, K=1664 :    26546.01 MFlops   0.347129 sec
 M=1792, N=1792, K=1792 :    24319.14 MFlops   0.473256 sec
 M=1920, N=1920, K=1920 :    24741.43 MFlops   0.572149 sec
 M=2048, N=2048, K=2048 :    23298.07 MFlops   0.737395 sec
odroid@focal-minimal:~/Developement/OpenBLAS/benchmark$ export OPENBLAS_NUM_THREADS=2
odroid@focal-minimal:~/Developement/OpenBLAS/benchmark$ ./sgemm.goto 128 2048 128
From : 128  To : 2048 Step=128 : Transa=N : Transb=N
          SIZE                   Flops             Time
 M= 128, N= 128, K= 128 :     9197.61 MFlops   0.000456 sec
 M= 256, N= 256, K= 256 :    13368.13 MFlops   0.002510 sec
 M= 384, N= 384, K= 384 :    17418.68 MFlops   0.006501 sec
 M= 512, N= 512, K= 512 :    17028.18 MFlops   0.015764 sec
 M= 640, N= 640, K= 640 :    18137.04 MFlops   0.028907 sec
 M= 768, N= 768, K= 768 :    17805.36 MFlops   0.050882 sec
 M= 896, N= 896, K= 896 :    18520.99 MFlops   0.077677 sec
 M=1024, N=1024, K=1024 :    17905.24 MFlops   0.119936 sec
 M=1152, N=1152, K=1152 :    18750.55 MFlops   0.163070 sec
 M=1280, N=1280, K=1280 :    18607.57 MFlops   0.225408 sec
 M=1408, N=1408, K=1408 :    18796.94 MFlops   0.296996 sec
 M=1536, N=1536, K=1536 :    18464.19 MFlops   0.392530 sec
 M=1664, N=1664, K=1664 :    18890.02 MFlops   0.487818 sec
 M=1792, N=1792, K=1792 :    18779.72 MFlops   0.612851 sec
 M=1920, N=1920, K=1920 :    18906.01 MFlops   0.748745 sec
 M=2048, N=2048, K=2048 :    17919.40 MFlops   0.958730 sec
odroid@focal-minimal:~/Developement/OpenBLAS/benchmark$ export OPENBLAS_NUM_THREADS=1
odroid@focal-minimal:~/Developement/OpenBLAS/benchmark$ ./sgemm.goto 128 2048 128
From : 128  To : 2048 Step=128 : Transa=N : Transb=N
          SIZE                   Flops             Time
 M= 128, N= 128, K= 128 :     5753.57 MFlops   0.000729 sec
 M= 256, N= 256, K= 256 :     8347.70 MFlops   0.004020 sec
 M= 384, N= 384, K= 384 :     8952.77 MFlops   0.012649 sec
 M= 512, N= 512, K= 512 :     8882.26 MFlops   0.030222 sec
 M= 640, N= 640, K= 640 :     9349.78 MFlops   0.056075 sec
 M= 768, N= 768, K= 768 :     9252.72 MFlops   0.097914 sec
 M= 896, N= 896, K= 896 :     9464.05 MFlops   0.152012 sec
 M=1024, N=1024, K=1024 :     9218.45 MFlops   0.232955 sec
 M=1152, N=1152, K=1152 :     9533.71 MFlops   0.320720 sec
 M=1280, N=1280, K=1280 :     9449.54 MFlops   0.443863 sec
 M=1408, N=1408, K=1408 :     9560.73 MFlops   0.583911 sec
 M=1536, N=1536, K=1536 :     9447.98 MFlops   0.767122 sec
 M=1664, N=1664, K=1664 :     9593.19 MFlops   0.960566 sec
 M=1792, N=1792, K=1792 :     9531.26 MFlops   1.207518 sec
 M=1920, N=1920, K=1920 :     9608.35 MFlops   1.473279 sec
 M=2048, N=2048, K=2048 :     9141.53 MFlops   1.879321 sec
Djip007 commented 3 years ago

@Djip007 please check official tree after official pull request. One you tested was incomplete as @martin-frbg pointed out. You can patch git checkout with complete pull request. wget -O- https://github.com/xianyi/OpenBLAS/pull/3278.diff | patch

One you tested compiled A55 cores but never selected them dynamically, you were at generic ARMv8 always + fortran5 with 8.2A tuning.

Well... I am not realy sur to make correct test... last test (on your branch... without last patch) I build with :

make  NO_LAPACK=1 TARGET=CORTEXA55
cd benchmark
make  NO_LAPACK=1 TARGET=CORTEXA55
export OPENBLAS_NUM_THREADS=[N]
./sgemm.goto 128 2048 128

if I understood the patch for multi mode correctly, I must test the merge request with:

make  NO_LAPACK=1 DYNAMIC_ARCH=1

???

Djip007 commented 3 years ago

Regarding cache tweaks: https://en.wikipedia.org/wiki/ARM_Cortex-A55 I posted assumption on lowest values L1I 32K L1D 16K L2 64K L3 - whatever

You got L1 instruction cache: 32 KB, 4-way set associative (128 sets), 64 byte lines, shared by 1 processor L1 data cache: 32 KB, 4-way set associative (128 sets), 64 byte lines, shared by 1 processor L3 data cache: 512KB , 16-way set associative (512 sets), 64 byte lines, shared by 4 processors

That is you got double insn cache, and double of L3 cache (next level after L1) per core. Similar 8-core construct will still have my assumption correct.

One sunny day with infinite developer hands that will be selected dynamically, for now it is set to minimal configuration assumption, so that data do not spill to (half-speed) next level cache and (5-10x slower) main RAM during intense computation. It is like 10% sub-optimal for largest-caches server CPU-s, but certainly avoids 10-20x slowdown on raspberries.

I think I got your point... and I am OK on the sizes. but I have a doubt about the other param... if L2 cache exists each core has all the lines ... but the L3 is shared between all the cores ... the lines are shared (Note: it is the same for the L2 of the A53) Couldn't the mediocre performance in 4 cores be due to this?

brada4 commented 3 years ago

If you feel adventurous work divider is in interface/gemm.c . Nobody knows optimal heuristic , such black holes are there for nearly every CPU type supported.

Djip007 commented 3 years ago

build of merge request is work in progress...

make  NO_LAPACK=1 TARGET=CORTEXA55
OpenBLAS build complete. (BLAS CBLAS)
  OS               ... Linux             
  Architecture     ... arm64               
  BINARY           ... 64bit                 
  C compiler       ... GCC  (cmd & version : cc (Ubuntu 10.2.0-13ubuntu1) 10.2.0)
  Fortran compiler ... GFORTRAN  (cmd & version : GNU Fortran (Ubuntu 10.2.0-13ubuntu1) 10.2.0)
  Library Name     ... libopenblas_cortexa55p-r0.3.15.dev.a (Multi-threading; Max num-threads is 4)
#>  export OPENBLAS_NUM_THREADS=1
#>  ./sgemm.goto 64 2048 64
From :  64  To : 2048 Step=64 : Transa=N : Transb=N
          SIZE                   Flops             Time
 M=  64, N=  64, K=  64 :      773.92 MFlops   0.000677 sec
 M= 128, N= 128, K= 128 :     7689.75 MFlops   0.000545 sec
 M= 192, N= 192, K= 192 :     8772.71 MFlops   0.001614 sec
 M= 256, N= 256, K= 256 :     8888.19 MFlops   0.003775 sec
 M= 320, N= 320, K= 320 :     8991.65 MFlops   0.007289 sec
 M= 384, N= 384, K= 384 :     9257.18 MFlops   0.012233 sec
 M= 448, N= 448, K= 448 :     9550.60 MFlops   0.018829 sec
 M= 512, N= 512, K= 512 :     9000.67 MFlops   0.029824 sec
 M= 576, N= 576, K= 576 :     9144.98 MFlops   0.041794 sec
 M= 640, N= 640, K= 640 :     9318.94 MFlops   0.056261 sec
 M= 704, N= 704, K= 704 :     9514.07 MFlops   0.073347 sec
 M= 768, N= 768, K= 768 :     9214.26 MFlops   0.098323 sec
 M= 832, N= 832, K= 832 :     9316.90 MFlops   0.123631 sec
 M= 896, N= 896, K= 896 :     9418.17 MFlops   0.152752 sec
 M= 960, N= 960, K= 960 :     9551.00 MFlops   0.185266 sec
 M=1024, N=1024, K=1024 :     9102.13 MFlops   0.235932 sec
 M=1088, N=1088, K=1088 :     9420.38 MFlops   0.273431 sec
 M=1152, N=1152, K=1152 :     9484.84 MFlops   0.322372 sec
 M=1216, N=1216, K=1216 :     9589.88 MFlops   0.374988 sec
 M=1280, N=1280, K=1280 :     9383.93 MFlops   0.446967 sec
 M=1344, N=1344, K=1344 :     9469.61 MFlops   0.512738 sec
 M=1408, N=1408, K=1408 :     9514.86 MFlops   0.586726 sec
 M=1472, N=1472, K=1472 :     9606.63 MFlops   0.664022 sec
 M=1536, N=1536, K=1536 :     9384.00 MFlops   0.772353 sec
 M=1600, N=1600, K=1600 :     9510.84 MFlops   0.861333 sec
 M=1664, N=1664, K=1664 :     9538.33 MFlops   0.966090 sec
 M=1728, N=1728, K=1728 :     9620.13 MFlops   1.072705 sec
 M=1792, N=1792, K=1792 :     9466.88 MFlops   1.215730 sec
 M=1856, N=1856, K=1856 :     9534.52 MFlops   1.341112 sec
 M=1920, N=1920, K=1920 :     9554.71 MFlops   1.481549 sec
 M=1984, N=1984, K=1984 :     9599.58 MFlops   1.627058 sec
 M=2048, N=2048, K=2048 :     8911.33 MFlops   1.927868 sec

#>  export OPENBLAS_NUM_THREADS=2
#>  ./sgemm.goto 64 2048 64
From :  64  To : 2048 Step=64 : Transa=N : Transb=N
          SIZE                   Flops             Time
 M=  64, N=  64, K=  64 :     2846.68 MFlops   0.000184 sec
 M= 128, N= 128, K= 128 :     8354.12 MFlops   0.000502 sec
 M= 192, N= 192, K= 192 :    13129.40 MFlops   0.001078 sec
 M= 256, N= 256, K= 256 :    14472.86 MFlops   0.002318 sec
 M= 320, N= 320, K= 320 :    16703.94 MFlops   0.003923 sec
 M= 384, N= 384, K= 384 :    17262.80 MFlops   0.006560 sec
 M= 448, N= 448, K= 448 :    17801.29 MFlops   0.010102 sec
 M= 512, N= 512, K= 512 :    17131.49 MFlops   0.015669 sec
 M= 576, N= 576, K= 576 :    17916.53 MFlops   0.021333 sec
 M= 640, N= 640, K= 640 :    18156.16 MFlops   0.028877 sec
 M= 704, N= 704, K= 704 :    18465.99 MFlops   0.037790 sec
 M= 768, N= 768, K= 768 :    18193.33 MFlops   0.049797 sec
 M= 832, N= 832, K= 832 :    18492.63 MFlops   0.062288 sec
 M= 896, N= 896, K= 896 :    18517.73 MFlops   0.077690 sec
 M= 960, N= 960, K= 960 :    18551.33 MFlops   0.095382 sec
 M=1024, N=1024, K=1024 :    17880.84 MFlops   0.120100 sec
 M=1088, N=1088, K=1088 :    18652.22 MFlops   0.138098 sec
 M=1152, N=1152, K=1152 :    18731.87 MFlops   0.163232 sec
 M=1216, N=1216, K=1216 :    18850.07 MFlops   0.190773 sec
 M=1280, N=1280, K=1280 :    18637.52 MFlops   0.225046 sec
 M=1344, N=1344, K=1344 :    18821.93 MFlops   0.257967 sec
 M=1408, N=1408, K=1408 :    18830.02 MFlops   0.296474 sec
 M=1472, N=1472, K=1472 :    18845.06 MFlops   0.338498 sec
 M=1536, N=1536, K=1536 :    18401.62 MFlops   0.393865 sec
 M=1600, N=1600, K=1600 :    18893.69 MFlops   0.433584 sec
 M=1664, N=1664, K=1664 :    18929.47 MFlops   0.486801 sec
 M=1728, N=1728, K=1728 :    19004.37 MFlops   0.543010 sec
 M=1792, N=1792, K=1792 :    18809.49 MFlops   0.611881 sec
 M=1856, N=1856, K=1856 :    18967.91 MFlops   0.674131 sec
 M=1920, N=1920, K=1920 :    18933.26 MFlops   0.747667 sec
 M=1984, N=1984, K=1984 :    18904.88 MFlops   0.826192 sec
 M=2048, N=2048, K=2048 :    17848.42 MFlops   0.962543 sec

#> export OPENBLAS_NUM_THREADS=4
#> ./sgemm.goto 64 2048 64
From :  64  To : 2048 Step=64 : Transa=N : Transb=N
          SIZE                   Flops             Time
 M=  64, N=  64, K=  64 :     3033.35 MFlops   0.000173 sec
 M= 128, N= 128, K= 128 :     4862.04 MFlops   0.000863 sec
 M= 192, N= 192, K= 192 :    17725.30 MFlops   0.000799 sec
 M= 256, N= 256, K= 256 :    20695.71 MFlops   0.001621 sec
 M= 320, N= 320, K= 320 :    26105.75 MFlops   0.002510 sec
 M= 384, N= 384, K= 384 :    27465.15 MFlops   0.004123 sec
 M= 448, N= 448, K= 448 :    28901.49 MFlops   0.006222 sec
 M= 512, N= 512, K= 512 :    28086.69 MFlops   0.009557 sec
 M= 576, N= 576, K= 576 :    29870.26 MFlops   0.012796 sec
 M= 640, N= 640, K= 640 :    29327.49 MFlops   0.017877 sec
 M= 704, N= 704, K= 704 :    27823.31 MFlops   0.025081 sec
 M= 768, N= 768, K= 768 :    24974.92 MFlops   0.036275 sec
 M= 832, N= 832, K= 832 :    26460.19 MFlops   0.043532 sec
 M= 896, N= 896, K= 896 :    25055.48 MFlops   0.057418 sec
 M= 960, N= 960, K= 960 :    24001.33 MFlops   0.073724 sec
 M=1024, N=1024, K=1024 :    22710.70 MFlops   0.094558 sec
 M=1088, N=1088, K=1088 :    32209.15 MFlops   0.079972 sec
 M=1152, N=1152, K=1152 :    31743.49 MFlops   0.096324 sec
 M=1216, N=1216, K=1216 :    31219.87 MFlops   0.115186 sec
 M=1280, N=1280, K=1280 :    29439.04 MFlops   0.142474 sec
 M=1344, N=1344, K=1344 :    29940.74 MFlops   0.162168 sec
 M=1408, N=1408, K=1408 :    29217.83 MFlops   0.191069 sec
 M=1472, N=1472, K=1472 :    28174.27 MFlops   0.226413 sec
 M=1536, N=1536, K=1536 :    26641.53 MFlops   0.272047 sec
 M=1600, N=1600, K=1600 :    27344.08 MFlops   0.299590 sec
 M=1664, N=1664, K=1664 :    26539.66 MFlops   0.347212 sec
 M=1728, N=1728, K=1728 :    25549.72 MFlops   0.403901 sec
 M=1792, N=1792, K=1792 :    24472.63 MFlops   0.470287 sec
 M=1856, N=1856, K=1856 :    25624.16 MFlops   0.499016 sec
 M=1920, N=1920, K=1920 :    25253.27 MFlops   0.560552 sec
 M=1984, N=1984, K=1984 :    24178.38 MFlops   0.645993 sec
 M=2048, N=2048, K=2048 :    23265.46 MFlops   0.738428 sec
#> make  NO_LAPACK=1 DYNAMIC_ARCH=1
OpenBLAS build complete. (BLAS CBLAS)
  OS               ... Linux             
  Architecture     ... arm64               
  BINARY           ... 64bit                 
  C compiler       ... GCC  (cmd & version : cc (Ubuntu 10.2.0-13ubuntu1) 10.2.0)
  Fortran compiler ... GFORTRAN  (cmd & version : GNU Fortran (Ubuntu 10.2.0-13ubuntu1) 10.2.0)
  Library Name     ... libopenblasp-r0.3.15.dev.a (Multi-threading; Max num-threads is 4)
  Supporting multiple arm64 cpu models with minimum requirement for the common code being CORTEXA55
odroid@focal-minimal:~/Developement/OpenBLAS/benchmark$ export OPENBLAS_NUM_THREADS=1
odroid@focal-minimal:~/Developement/OpenBLAS/benchmark$ ./sgemm.goto 64 2048 64
From :  64  To : 2048 Step=64 : Transa=N : Transb=N
          SIZE                   Flops             Time
 M=  64, N=  64, K=  64 :     2490.56 MFlops   0.000211 sec
 M= 128, N= 128, K= 128 :     7625.66 MFlops   0.000550 sec
 M= 192, N= 192, K= 192 :     8778.83 MFlops   0.001612 sec
 M= 256, N= 256, K= 256 :     8559.60 MFlops   0.003920 sec
 M= 320, N= 320, K= 320 :     8685.53 MFlops   0.007545 sec
 M= 384, N= 384, K= 384 :     8783.26 MFlops   0.012893 sec
 M= 448, N= 448, K= 448 :     9508.14 MFlops   0.018913 sec
 M= 512, N= 512, K= 512 :     9040.48 MFlops   0.029693 sec
 M= 576, N= 576, K= 576 :     9230.52 MFlops   0.041407 sec
 M= 640, N= 640, K= 640 :     9357.00 MFlops   0.056032 sec
 M= 704, N= 704, K= 704 :     9532.46 MFlops   0.073205 sec
 M= 768, N= 768, K= 768 :     9214.04 MFlops   0.098325 sec
 M= 832, N= 832, K= 832 :     9367.86 MFlops   0.122959 sec
 M= 896, N= 896, K= 896 :     9447.53 MFlops   0.152278 sec
 M= 960, N= 960, K= 960 :     9566.27 MFlops   0.184970 sec
 M=1024, N=1024, K=1024 :     9177.08 MFlops   0.234005 sec
 M=1088, N=1088, K=1088 :     9455.96 MFlops   0.272403 sec
 M=1152, N=1152, K=1152 :     9506.57 MFlops   0.321635 sec
 M=1216, N=1216, K=1216 :     9605.74 MFlops   0.374369 sec
 M=1280, N=1280, K=1280 :     9399.90 MFlops   0.446207 sec
 M=1344, N=1344, K=1344 :     9499.94 MFlops   0.511101 sec
 M=1408, N=1408, K=1408 :     9542.64 MFlops   0.585019 sec
 M=1472, N=1472, K=1472 :     9625.37 MFlops   0.662729 sec
 M=1536, N=1536, K=1536 :     9413.08 MFlops   0.769967 sec
 M=1600, N=1600, K=1600 :     9535.88 MFlops   0.859071 sec
 M=1664, N=1664, K=1664 :     9565.70 MFlops   0.963325 sec
 M=1728, N=1728, K=1728 :     9636.27 MFlops   1.070908 sec
 M=1792, N=1792, K=1792 :     9478.55 MFlops   1.214234 sec
 M=1856, N=1856, K=1856 :     9556.59 MFlops   1.338015 sec
 M=1920, N=1920, K=1920 :     9577.74 MFlops   1.477987 sec
 M=1984, N=1984, K=1984 :     9629.76 MFlops   1.621958 sec
 M=2048, N=2048, K=2048 :     9125.32 MFlops   1.882660 sec
odroid@focal-minimal:~/Developement/OpenBLAS/benchmark$ export OPENBLAS_NUM_THREADS=2
odroid@focal-minimal:~/Developement/OpenBLAS/benchmark$ ./sgemm.goto 64 2048 64
From :  64  To : 2048 Step=64 : Transa=N : Transb=N
          SIZE                   Flops             Time
 M=  64, N=  64, K=  64 :     2781.86 MFlops   0.000188 sec
 M= 128, N= 128, K= 128 :    12008.91 MFlops   0.000349 sec
 M= 192, N= 192, K= 192 :    14505.68 MFlops   0.000976 sec
 M= 256, N= 256, K= 256 :    15427.52 MFlops   0.002175 sec
 M= 320, N= 320, K= 320 :    17035.76 MFlops   0.003847 sec
 M= 384, N= 384, K= 384 :    17265.87 MFlops   0.006559 sec
 M= 448, N= 448, K= 448 :    17746.82 MFlops   0.010133 sec
 M= 512, N= 512, K= 512 :    17186.65 MFlops   0.015619 sec
 M= 576, N= 576, K= 576 :    17791.97 MFlops   0.021482 sec
 M= 640, N= 640, K= 640 :    18030.39 MFlops   0.029078 sec
 M= 704, N= 704, K= 704 :    18293.16 MFlops   0.038147 sec
 M= 768, N= 768, K= 768 :    18064.99 MFlops   0.050151 sec
 M= 832, N= 832, K= 832 :    18303.48 MFlops   0.062931 sec
 M= 896, N= 896, K= 896 :    18390.56 MFlops   0.078227 sec
 M= 960, N= 960, K= 960 :    18421.11 MFlops   0.096057 sec
 M=1024, N=1024, K=1024 :    17798.18 MFlops   0.120657 sec
 M=1088, N=1088, K=1088 :    18528.99 MFlops   0.139016 sec
 M=1152, N=1152, K=1152 :    18620.11 MFlops   0.164212 sec
 M=1216, N=1216, K=1216 :    18735.30 MFlops   0.191942 sec
 M=1280, N=1280, K=1280 :    18523.94 MFlops   0.226426 sec
 M=1344, N=1344, K=1344 :    18699.89 MFlops   0.259650 sec
 M=1408, N=1408, K=1408 :    18703.58 MFlops   0.298479 sec
 M=1472, N=1472, K=1472 :    18714.64 MFlops   0.340857 sec
 M=1536, N=1536, K=1536 :    18357.50 MFlops   0.394812 sec
 M=1600, N=1600, K=1600 :    18781.82 MFlops   0.436166 sec
 M=1664, N=1664, K=1664 :    18804.43 MFlops   0.490038 sec
 M=1728, N=1728, K=1728 :    18886.51 MFlops   0.546399 sec
 M=1792, N=1792, K=1792 :    18704.13 MFlops   0.615328 sec
 M=1856, N=1856, K=1856 :    18842.22 MFlops   0.678628 sec
 M=1920, N=1920, K=1920 :    18800.87 MFlops   0.752932 sec
 M=1984, N=1984, K=1984 :    18786.27 MFlops   0.831408 sec
 M=2048, N=2048, K=2048 :    17878.44 MFlops   0.960926 sec
odroid@focal-minimal:~/Developement/OpenBLAS/benchmark$ export OPENBLAS_NUM_THREADS=4
odroid@focal-minimal:~/Developement/OpenBLAS/benchmark$ ./sgemm.goto 64 2048 64
From :  64  To : 2048 Step=64 : Transa=N : Transb=N
          SIZE                   Flops             Time
 M=  64, N=  64, K=  64 :     2676.52 MFlops   0.000196 sec
 M= 128, N= 128, K= 128 :     4931.11 MFlops   0.000851 sec
 M= 192, N= 192, K= 192 :    18020.53 MFlops   0.000786 sec
 M= 256, N= 256, K= 256 :    20492.93 MFlops   0.001637 sec
 M= 320, N= 320, K= 320 :    26012.91 MFlops   0.002519 sec
 M= 384, N= 384, K= 384 :    27427.17 MFlops   0.004129 sec
 M= 448, N= 448, K= 448 :    28680.99 MFlops   0.006270 sec
 M= 512, N= 512, K= 512 :    27976.91 MFlops   0.009595 sec
 M= 576, N= 576, K= 576 :    29603.12 MFlops   0.012911 sec
 M= 640, N= 640, K= 640 :    29189.17 MFlops   0.017962 sec
 M= 704, N= 704, K= 704 :    25844.33 MFlops   0.027001 sec
 M= 768, N= 768, K= 768 :    24763.68 MFlops   0.036585 sec
 M= 832, N= 832, K= 832 :    26657.27 MFlops   0.043210 sec
 M= 896, N= 896, K= 896 :    25174.39 MFlops   0.057147 sec
 M= 960, N= 960, K= 960 :    23439.30 MFlops   0.075492 sec
 M=1024, N=1024, K=1024 :    22255.55 MFlops   0.096492 sec
 M=1088, N=1088, K=1088 :    32171.28 MFlops   0.080066 sec
 M=1152, N=1152, K=1152 :    31792.78 MFlops   0.096174 sec
 M=1216, N=1216, K=1216 :    31240.63 MFlops   0.115109 sec
 M=1280, N=1280, K=1280 :    29603.35 MFlops   0.141683 sec
 M=1344, N=1344, K=1344 :    30045.89 MFlops   0.161600 sec
 M=1408, N=1408, K=1408 :    29354.56 MFlops   0.190179 sec
 M=1472, N=1472, K=1472 :    28315.85 MFlops   0.225281 sec
 M=1536, N=1536, K=1536 :    26723.41 MFlops   0.271214 sec
 M=1600, N=1600, K=1600 :    27362.91 MFlops   0.299383 sec
 M=1664, N=1664, K=1664 :    26617.00 MFlops   0.346203 sec
 M=1728, N=1728, K=1728 :    25587.49 MFlops   0.403305 sec
 M=1792, N=1792, K=1792 :    24427.06 MFlops   0.471165 sec
 M=1856, N=1856, K=1856 :    25324.48 MFlops   0.504921 sec
 M=1920, N=1920, K=1920 :    24782.22 MFlops   0.571207 sec
 M=1984, N=1984, K=1984 :    24098.30 MFlops   0.648140 sec
 M=2048, N=2048, K=2048 :    22832.33 MFlops   0.752436 sec
brada4 commented 3 years ago

Slightly better than ARMv8 core as expected.

Djip007 commented 3 years ago

OK with the "min" cache size you made! for A53 cache config... look to be not the "minimum"... :sunglasses: min is L1: 8k L2: 128/4 (unified cache?) => 32k

Djip007 commented 3 years ago

Slightly better than ARMv8 core as expected.

yep!

brada4 commented 3 years ago

We cannot change A53, I just explained my rationale behind low numbers here as to stay very conservative vs getting maximum out from biggest CPU.

Djip007 commented 3 years ago

We cannot change A53, I just explained my rationale behind low numbers here as to stay very conservative vs getting maximum out from biggest CPU.

:+1: ... my test with != cache size was juste to know what I can get... and from now with correct dynamic... nothing ... :( ...

Djip007 commented 3 years ago

I put together the A53 / A55 because both are "in order" core ... A73 / A75 are "out of order" so I thought that some optim could benefit both or at least the ideas that go with

brada4 commented 3 years ago

No problem whatsoever. Thank you for heads up on unknown CPUID.

martin-frbg commented 3 years ago

Note that #2618 already added a cpu-specific SGEMM kernel for A53 (which is now used for the A55 as well) to address the "dual issue fmla" limitations