ARM Cortex-A76? - Githubissues

jfikar commented 3 months ago

Hello, OpenBLAS-0.3.26 does currently not support Cortex-A76. Is there a way to add it? I believe, popularity of Cortex-A76 will increase due to Raspberry Pi5, Orange Pi5 and Radaxa Rock-5B (the latter two have big.little cores A76+A55).

I did some testing though. The fastest for HPL (dgemm) seems to be Cortex-X1. But it seems, there is again a problem with cache, as with Raspberry Pi 4. I.e. the single core performance is close to the theoretical 8 ops/cycle/core, but by increasing the thread count it gets significantly worse.

Raspberry Pi5, 4x Cortex-A76 2.4GHz, 512KB L2 per core, 2MB shared L3, OpenBLAS-0.3.26, CORTEXX1, HPL N=28000, NB=240

Threads	GFlop/s	ops/cycle/core
1	15.53	6.5
2	30.71	6.4
3	38.37	5.3
4	19.25	2.0

This is with the official heatsink and active cooler. There is no thermal throttling detected.

There is no slowdown on Rock-5B, which differs only in slightly bigger L3 cache (and slightly lower speed).

Rock-5B, testing just the big cluster: 4x Cortex-A76 2.3GHz, 512KB L2 per core, 3MB shared L3, OpenBLAS-0.3.26, CORTEXX1, HPL N=28000, NB=240

Threads	GFlop/s	ops/cycle/core
1	15.91	6.9
2	33.00	7.2
3	48.07	6.7
4	61.59	6.7

This is also with the official heatsink and active cooler. But the Rock-5B needs a little bit more cooling (extra fan), otherwise there is thermal throttling. Moreover, the frequencies are reported still as not being throttled.

I did also test the Blis library from git. It looks like the firestorm (meant for Apple M1 ARM) target is the fastest on Cortex-A76. There is no significant slowdown.

Raspberry Pi5, Blis git, firestorm

Threads	GFlop/s	ops/cycle/core
1	16.99	7.1
2	32.67	6.8
3	36.00	5.0
4	55.61	5.8

Rock-5B, Blis git, firestorm

Threads	GFlop/s	ops/cycle/core
1	15.96	6.9
2	31.88	6.9
3	47.29	6.6
4	60.29	6.6

Going back to OpenBLAS, I looked in the param.h file and CORTEXX1 seems to be handled as a CPU with large cache even if it has less then 8 cores.

/*FIXME: this should be using the cache size, but there is currently no easy way to
query that on ARM. So if getarch counted more than 8 cores we simply assume the host
is a big desktop or server with abundant cache rather than a phone or embedded device */ 
#if NUM_CORES > 8 || defined(TSV110) || defined(EMAG8180) || defined(VORTEX)|| defined(CORTEXX1)

I tried to adjust the DGEMM_DEFAULT_P and DGEMM_DEFAULT_Q to lower values of 160 and 128. It seems to help a bit, but not completely. It hurts Rock-5B though, but not much.

Raspberry Pi5, OpenBLAS, CORTEXX1, DGEMM_DEFAULT_P=160, DGEMM_DEFAULT_Q=128

Threads	GFlop/s	ops/cycle/core
1	15.35	6.4
2	28.40	5.9
3	35.07	4.9
4	38.60	4.0

Rock-5B, OpenBLAS, CORTEXX1, DGEMM_DEFAULT_P=160, DGEMM_DEFAULT_Q=128

Threads	GFlop/s	ops/cycle/core
1	15.90	6.9
2	31.76	6.9
3	46.43	6.7
4	59.09	6.4

Is there something else to try?

BTW. the auto-detection during build of OpenBLAS chooses target ARMV8 on Raspberry Pi and CORTEXA55 on Rock-5B (maybe because the little cores are A55 and it is also the core number 0?). I guess, the same happens if DYNAMIC_ARCH is selected? Is there a way to tell, which of the targets is chosen using DYNAMIC_ARCH?

Mousius commented 3 months ago

Hi @jfikar,

The DYNAMIC_ARCH selection happens in https://github.com/OpenMathLib/OpenBLAS/blob/develop/driver/others/dynamic_arm64.c. There's nothing here to detect this part, so it makes sense it'd fall back to the generic ARMV8, as that's safest.

Looking at https://github.com/OpenMathLib/OpenBLAS/blob/develop/param.h#L3308C2-L3308C4 and https://github.com/OpenMathLib/OpenBLAS/blob/develop/kernel/arm64/KERNEL.CORTEXX1, the CORTEXX1 target is an alias for CORTEXA57 - except without the assumption around caches. Did you try the CORTEXA57 target?

If that works, I suggest we add the correct part (I think it's 0xD0B) as an alias for CORTEXA57 in a similar way.

martin-frbg commented 3 months ago

Autodetection is problematic on Big.little systems as the returned ID may depend on system load - if the system has been idle for a while, the little cores will get detected, if there has been any compute/compile load immediately prior to the check, it will pick up the big one(s). At runtime, you can use export OPENBLAS_VERBOSE=2 to make it report what it selected, or OPENBLAS_CORETYPE=CORTEXX1 to force selection of a particular target. (I guess you could try NEOVERSEN1 as an existing variant, but this will probably suffer from wrong assumptions about the available cache too. CORTEXA57 is the generic role model for Cortex-A, but I expect performance difference to ARMV8 to be minimal

jfikar commented 3 months ago

The OPENBLAS_VERBOSE=2 works, thanks. On the Raspberry Pi5 it gives armv8 and on Rock-5b it gives cortexa55. Could not get other answer, but it may be possible. And I agree, the auto-detection is difficult on big.little.

These are the results for armv8, cortexa57, and neoversen1 targets, which were suggested:

armv8:	Threads	GFlop/s RPi	ops/cycle/core RPi	GFlop/s Rock
1	15.31	6.4	15.86	6.9
2	28.71	6.0	31.68	6.9
3	35.01	4.9	46.38	6.4
4	39.17	4.1	59.69	6.5

cortexa57:	Threads	GFlop/s RPi	ops/cycle/core RPi	GFlop/s Rock
1	15.33	6.4	16.02	7.0
2	28.60	6.0	31.67	6.9
3	34.90	4.8	46.73	6.8
4	39.01	4.1	59.31	6.4

neoversen1:	Threads	GFlop/s RPi	ops/cycle/core RPi	GFlop/s Rock
1	16.07	6.7	16.53	7.2
2	31.27	6.5	33.15	7.2
3	39.73	5.5	48.28	7.0
4	21.69	2.3	62.66	6.8

I also did Blis cortexa57:	Threads	GFlop/s RPi	ops/cycle/core RPi	GFlop/s Rock
1	16.53	6.9	16.21	7.0
2	31.48	6.6	32.46	7.1
3	43.96	6.1	46.75	6.8
4	50.28	5.2	60.12	6.5

The neoversen1 is good on the Rock-5B Cortex-A76. On Raspberry Pi5 it is probably cache limited again as the cortexx1 target.

The cortexx1 target with reduced DGEMM_DEFAULT_P and DGEMM_DEFAULT_Q values performs as cortexa57 and as armv8. It is not so drastically slowed down for all the 4 cores, but still it is somehow under-performing, not even reaching 40 GFlop/s. Blis gives 50 and 55 GFlop/s when using cortexa57 and firestorm targets.

It is strange that these two Cortex-A76 perform so differently, when the only difference is 3MB L3 cache instead of 2MB.

martin-frbg commented 3 months ago

Thanks. That pronounced difference between the two boards is indeed strange - I'd be tempted to alias NeoverseN1 if it did not take such a plunge at 4 cores. Might be worth playing with the SWITCH_RATIO for NEOVERSEN1 in param.h (or introducing same for A57/ARMV8) - or perhaps it is something else like temperature/frequency management ?

jfikar commented 3 months ago

It's a fair point. I trust more the frequencies on Raspberry Pi 5. It has a tool to check, if the thermal throttling had occurred vcgencmd get_throttled. It still shows 0x0, i.e. no throttling occurred since the last restart. But I have measured maximum 78C, which is close to the first thermal throttling limit of 80C. So I'll repeat some of the measurements with an additional fan.

The Rock-5B needs additional fan as without it it hits 75C and silently starts throttling. With the additional cooling it is below 65C.

SWITCH_RATIO is 8 for neoversen1 and double, while for A57/armv8 it is 2, right? I can try to change it to 4 and 2.

Also neoversen1 has DGEMM_DEFAULT_P=240 and DGEMM_DEFAULT_Q=320. That's more than 160, 128 for A57/armv8, but it is less than 256, 512 for cortexx1.

jfikar commented 3 months ago

Here are the new results for Raspberry Pi 5 with additional cooling. I checked that the temperature was below 65C all the time, so far away from the 80C limit. The results are slightly better, but at maximum by 7%, so it still does not explain the drop with 4 threads.

I'm showing the results for neoversen1, RPi5:

Threads	GFlop/s	ops/cycle/core	GFlop/s +extra cooling	ops/cycle/core +extra cooling
1	16.07	6.7	16.14	6.7
2	31.27	6.5	31.32	6.5
3	39.73	5.5	40.68	5.7
4	21.69	2.3	23.12	2.4

Also the good results of Blis at 4 threads suggest that it is probably not thermal effect.

martin-frbg commented 3 months ago

I only brought up the cooling issue because the performance with 4 threads was so different between the two boards. Does it react to changes in SWITCH_RATIO at all ? (I do not think we have ever seen effects bigger than about 10 percent with poor choices for either SWITCH_RATIO or P,Q, and those would be more likely to depend on L2 than L3)

jfikar commented 3 months ago

I have new results. There is no effect of the SWITCH_RATIO. I have tried the default 8, 4, and 2. The effect of reducing P and Q is similar to what we have seen before.

Neoverse N1 P=240 Q=320 (default) 4 threads 22.69 GFlop/s Neoverse N1 P=160 Q=128 (cortexa57) 4 threads 39.75 GFlop/s

Still can't get even past 40 GFlop/s. How do you properly adjust the P and Q values?

martin-frbg commented 3 months ago

Good question, dark art (and probably lost knowledge). P*Q is supposed to be roughly equal to half the L2 cache for a start but beyond that point it is probably down to benchmarking. I now wonder if changing (halving/doubling) DGEMM_DEFAULT_R has any influence (though it might hurt single-thread performance).

jfikar commented 3 months ago

The R=2048, 4096 or 8192 does not have much influence. However, it seems I have found a good combination of P and Q, which performs well on both my A76s.

0: P=240 Q=320 (Neoverse N1 default)	Threads	RPi5 GFlop/s R=2048	RPi5 GFlop/s R=4096	RPi5 GFlop/s R=8192	Rock5 GFlop/s R=2048	Rock5 GFlop/s R=4096
1	15.99	16.06	16.18	17.15	17.19	17.21
2	30.71	31.12	31.12	33.39	33.42	33.43
3	38.88	41.02	41.32	48.47	48.49	48.55
4	22.80	22.33	22.17	62.69	62.54	62.89

1: P=120 Q=160 (Neoverse N1 divided by 2)	Threads	RPi5 GFlop/s R=2048	RPi5 GFlop/s R=4096	RPi5 GFlop/s R=8192	Rock5 GFlop/s R=2048	Rock5 GFlop/s R=4096
1	15.17	15.01	14.95	16.46	16.36	16.34
2	28.48	28.32	28.56	31.84	31.82	31.93
3	33.77	34.71	35.59	46.96	46.89	46.89
4	37.94	38.67	39.23	61.27	61.23	61.20

2: P=160 Q=128 (CortexA57 default)	Threads	RPi5 GFlop/s R=2048	RPi5 GFlop/s R=4096	RPi5 GFlop/s R=8192	Rock5 GFlop/s R=2048	Rock5 GFlop/s R=4096
1	15.42	15.29	15.24	16.62	16.55	16.40
2	28.91	28.74	28.79	32.35	32.36	32.35
3	34.75	35.95	37.05	47.67	47.66	47.66
4	37.68	39.53	41.23	62.20	62.20	62.23

3: P=80 Q=64 (CortexA57 divided by 2)	Threads	RPi5 GFlop/s R=2048	RPi5 GFlop/s R=4096	RPi5 GFlop/s R=8192	Rock5 GFlop/s R=2048	Rock5 GFlop/s R=4096
1	14.02	13.54	13.28	15.08	15.05	14.85
2	21.62	20.91	20.96	29.28	28.73	28.54
3	20.17	20.14	20.87	37.31	37.15	37.42
4	20.15	20.32	21.25	48.54	48.45	48.48

4: P=256 Q=512 (Cortex-X1 default)	Threads	RPi5 GFlop/s R=2048	RPi5 GFlop/s R=4096	RPi5 GFlop/s R=8192	Rock5 GFlop/s R=2048	Rock5 GFlop/s R=4096
1	15.61	15.69	15.73	16.91	16.79	16.80
2	30.19	30.39	30.69	32.91	32.72	32.96
3	39.00	39.47	39.89	47.73	47.90	47.87
4	21.25	20.26	20.77	61.80	61.72	61.95

5: P=128 Q=256 (Cortex-X1 divided by 2)	Threads	RPi5 GFlop/s R=2048	RPi5 GFlop/s R=4096	RPi5 GFlop/s R=8192	Rock5 GFlop/s R=2048	Rock5 GFlop/s R=4096
1	16.22	16.20	16.20	16.98	17.00	16.86
2	30.98	31.00	31.16	33.11	33.24	33.14
3	42.28	42.54	43.66	48.98	48.96	49.06
4	46.55	47.26	47.96	64.22	64.13	64.07

6: P=128 Q=160 (reversed 2)	Threads	RPi5 GFlop/s R=2048	RPi5 GFlop/s R=4096	RPi5 GFlop/s R=8192	Rock5 GFlop/s R=2048	Rock5 GFlop/s R=4096
1	15.25	15.09	15.04	16.54	16.28	16.42
2	28.08	28.22	28.10	31.63	31.80	31.77
3	33.93	34.47	35.56	46.36	45.96	46.00
4	37.26	37.91	39.59	59.07	59.07	59.07

7: P=160 Q=224 (between 0 and 1)	Threads	RPi5 GFlop/s R=2048	RPi5 GFlop/s R=4096	RPi5 GFlop/s R=8192	Rock5 GFlop/s R=2048	Rock5 GFlop/s R=4096
1	15.49	15.34	15.29	16.63	16.41	16.41
2	28.62	28.68	28.69	32.08	31.95	31.97
3	34.27	35.40	36.22	46.77	46.89	46.61
4	37.83	39.22	41.05	59.99	59.96	60.22

8: P=112 Q=96 (between 2 and 3)	Threads	RPi5 GFlop/s R=2048	RPi5 GFlop/s R=4096	RPi5 GFlop/s R=8192	Rock5 GFlop/s R=2048	Rock5 GFlop/s R=4096
1	14.87	14.57	14.41	15.98	15.91	15.72
2	26.02	25.98	26.19	30.97	30.94	30.81
3	27.23	27.77	29.13	45.32	45.09	45.15
4	27.77	28.28	29.19	59.00	58.91	58.96

9: P=176 Q=352 (between 4 and 5)	Threads	RPi5 GFlop/s R=2048	RPi5 GFlop/s R=4096	RPi5 GFlop/s R=8192	Rock5 GFlop/s R=2048	Rock5 GFlop/s R=4096
1	16.29	16.40	16.45	17.21	17.20	17.05
2	31.15	31.39	31.43	33.08	33.05	33.10
3	42.98	42.76	42.88	48.32	48.28	48.31
4	38.90	37.13	34.63	62.30	62.40	62.30

It looks like the choice 5: is the fastest for 4 threads on both A76s. I have more benchmarks. But 5: is up to now the best.

martin-frbg commented 3 months ago

Thank you very much for the extensive testing. Guess we can go with choice 5 for 0.3.27 and do some tweaks later if necessary (including to the S/C/Z GEMM parameters, which I guess will start out as 1/2 the CortexX1 value too). Interesting that GEMM_R has so little effect, this was the parameter I'd have thought most likely to be connected to L3 size...

jfikar commented 3 months ago

Looks good. Only the latest P, Q, R parameters were benchmarked using neoversen1 target and kernel. And you propose cortexa57 kernel. Maybe the confusion is due to the optimal P and Q to be half of the ones of cortexx1, which has cortexa57 kernel?

I can try with cortexa57 kernel as well, to see, if it is better.

martin-frbg commented 3 months ago

Makes no difference for the GEMM kernels actually (except that N1 defines the SWITCH_RATIO parameter for level3 BLAS while A57 doesn't, but I'm sceptical if it is significant), N1 only differs in choices for a handful of mostly level1 kernels

jfikar commented 3 months ago

OK. I believe I discovered, what may cause the different behavior of RPi5 and Rock5B, despite they both are Cortex-A76 with almost the same frequency.

It turns out, the Rock5B (and other RK3588 SBC) has twice as fast RAM bandwidth (30GB/s), when compared to RPi 5 (15GB/s).

https://github.com/ThomasKaiser/sbc-bench/blob/master/Results.md

Then the Rock5B is less sensitive to the right choice of P and Q, as it suffers less, when the problem does no more fit in caches.

martin-frbg commented 3 months ago

Ok, that might well be the case... I guess I'll merge as-is and things can be further improved later if/where necessary.

jfikar commented 3 months ago

The new target cortexa76 seems to be a bit slower on both my machines than the neoversen1 with P=128 and Q=256.

Threads	GFlop/s A76 RPi	GFlop/s N1 RPi	GFlop/s A76 Rock5	GFlop/s N1 Rock5
1	15.60	16.25	16.15	17.00
2	29.77	31.19	31.35	33.24
3	40.44	42.54	45.90	48.96
4	44.92	48.12	58.90	64.41

I don't know why.

I also tried the DYNAMIC_ARCH, but it seems cortexa76 is not yet included:

Falling back to generic ARMV8 core
Core: armv8

Forcing OPENBLAS_CORETYPE=CORTEXA76 gives

Core not found: CORTEXA76
Falling back to generic ARMV8 core

This is on RPi5, to avoid big.little problems.

martin-frbg commented 3 months ago

Hmm, not sure why it would be slower - pretty sure I have copied the correct data in #4597 . (I do realize now that the N1 has a full set of optimized TRMM kernels and better copy kernels for SGEMM, but that should not affect DGEMM performance at all. Does it get faster for you when you copy KERNEL.VORTEX over KERNEL.CORTEXA76 (quick hack to make it include the N1 kernels instead of A57) ? )

I have not (yet?) included the new target in DYNAMIC_ARCH as we'd had cut back on the number of dedicated arm64 targets just recently (#4389) - should probably have this cpu fall back to either A57 or N1 rather than generic ARMV8 though.

jfikar commented 3 months ago

It is strange, the faster results of modified neoversen1 are from 0.3.26 and the cortexa76 results are from development. If I do neoversen1 or vortex kernel in development, it is the same and corresponds to the slower numbers.

So it seems your PR is fine, but something happened between 0.3.26 - development. Do you suspect a certain commit, or should I try bisecting?

martin-frbg commented 3 months ago

The only relevant one should be #4585 (to cap the number of threads) that I merged only yesterday, but it is not supposed to trigger (and would make your results drastically worse across all targets if it went wrong). Apart from that, NeoverseN1 and Vortex lost their slightly faster DNRM2 due to #4595 but that should not matter for DGEMM.

martin-frbg commented 3 months ago

And 4585 wasn't merged when you reported the slower numbers yesterday

jfikar commented 3 months ago

Going to bisect. There is a lot of commits between 0.3.36 and today. This is good for the project, but not so much for the bisecting: Bisecting: 212 revisions left to test after this (roughly 8 steps)

BTW I saw that arm64 DYNAMIC_ARCH is slimmer than it used to be. For 0.3.25 the binary with statically linked OpenBLAS was 13MB, while 0.3.26 only 8.8MB. Still, a single target binary is only 300kB.

martin-frbg commented 3 months ago

Yes, it is not easy to find a balance between releasing so often that nobody has a chance to keep up, and releasing so late that the number of changes becomes daunting - two or three months between releases seemed to give the best tradeoff on average. Though of course it will vary once you add unexpected code contributions and unexpected events in real life...

jfikar commented 3 months ago

I had a problem doing the bisect: I was getting only the lower numbers. Even for the 0.3.26 I was not able to reproduce my own results in the big table 5.

It turns out that I sort of messed the compilation of OpenBLAS during the 0-9 tests (the very first one 0.2048 is fine though).

So what I did: compiled in a loop with different param.h like this:

copy param.h.x.y to param.h
make -j4
make PREFIX= install

It turns out that the resulting binary files are different if I do it correctly with make clean. Although the timestamp on param.h is updated by cp, so make should recompile all necessary files automatic, shouldn't it? It does for a lot of files, but misses some.

copy param.h.x.y to param.h
make clean
make -j4
make PREFIX= install

I see couple of *gemm*.o files not updated by the first approach, but some should have been recompiled?

$ find . -name '*gemm*.o' | xargs ls -l -t
...
-rw-r--r-- 1 rock rock  2032 Apr  7 04:06 ./kernel/zgemm_otcopy.o
-rw-r--r-- 1 rock rock  1584 Apr  7 04:06 ./kernel/zgemm_beta.o
-rw-r--r-- 1 rock rock  1952 Apr  7 04:06 ./kernel/zgemm_oncopy.o
-rw-r--r-- 1 rock rock 12592 Apr  7 04:06 ./kernel/zgemm_kernel_l.o
-rw-r--r-- 1 rock rock 12656 Apr  7 04:06 ./kernel/zgemm_kernel_b.o
-rw-r--r-- 1 rock rock 12656 Apr  7 04:06 ./kernel/zgemm_kernel_r.o
-rw-r--r-- 1 rock rock 12592 Apr  7 04:06 ./kernel/zgemm_kernel_n.o
-rw-r--r-- 1 rock rock  1584 Apr  7 04:06 ./kernel/cgemm_beta.o
-rw-r--r-- 1 rock rock  1904 Apr  7 04:06 ./kernel/cgemm_otcopy.o
-rw-r--r-- 1 rock rock  1848 Apr  7 04:06 ./kernel/cgemm_incopy.o
-rw-r--r-- 1 rock rock  1784 Apr  7 04:06 ./kernel/cgemm_itcopy.o
-rw-r--r-- 1 rock rock  2312 Apr  7 04:06 ./kernel/cgemm_oncopy.o
-rw-r--r-- 1 rock rock 14520 Apr  7 04:06 ./kernel/cgemm_kernel_b.o
-rw-r--r-- 1 rock rock 14392 Apr  7 04:06 ./kernel/cgemm_kernel_l.o
-rw-r--r-- 1 rock rock 14520 Apr  7 04:06 ./kernel/cgemm_kernel_r.o
-rw-r--r-- 1 rock rock 14392 Apr  7 04:06 ./kernel/cgemm_kernel_n.o
-rw-r--r-- 1 rock rock  1288 Apr  7 04:06 ./kernel/dgemm_otcopy.o
-rw-r--r-- 1 rock rock  1304 Apr  7 04:06 ./kernel/dgemm_oncopy.o
-rw-r--r-- 1 rock rock  1896 Apr  7 04:06 ./kernel/dgemm_itcopy.o
-rw-r--r-- 1 rock rock  1904 Apr  7 04:06 ./kernel/dgemm_incopy.o
-rw-r--r-- 1 rock rock  9192 Apr  7 04:06 ./kernel/dgemm_kernel.o
-rw-r--r-- 1 rock rock  1352 Apr  7 04:06 ./kernel/dgemm_beta.o
-rw-r--r-- 1 rock rock  1872 Apr  7 04:06 ./kernel/sgemm_otcopy.o
-rw-r--r-- 1 rock rock  3184 Apr  7 04:06 ./kernel/sgemm_incopy.o
-rw-r--r-- 1 rock rock  1344 Apr  7 04:06 ./kernel/sgemm_oncopy.o
-rw-r--r-- 1 rock rock  2344 Apr  7 04:06 ./kernel/sgemm_itcopy.o
-rw-r--r-- 1 rock rock 10232 Apr  7 04:06 ./kernel/sgemm_kernel.o
-rw-r--r-- 1 rock rock  1416 Apr  7 04:06 ./kernel/sgemm_beta.o
-rw-r--r-- 1 rock rock  2184 Apr  7 04:06 ./driver/level3/gemm_thread_variable.o
-rw-r--r-- 1 rock rock  2896 Apr  7 04:06 ./driver/level3/gemm_thread_mn.o
-rw-r--r-- 1 rock rock  1880 Apr  7 04:06 ./driver/level3/gemm_thread_n.o
-rw-r--r-- 1 rock rock  1880 Apr  7 04:06 ./driver/level3/gemm_thread_m.o

The good news is: there is no regression it the OpenBLAS code itself. I compiled it incorrectly by omitting make clean to save some time.

The bad news is that the parameter choice 5 is not so good anymore. I'm running all the tests again now.

martin-frbg commented 3 months ago

That's a bit unfortunate (mostly because of the time you spent on it), but it was "advertised" as initial support in the Changelog and Release announcement... With the recursive inclusions in the Makefiles and the parallel build I think it has never been safe to rely on make rebuilding all the changed files, as far as I remember it was always make clean; make even with the original GotoBLAS.

jfikar commented 2 months ago

I've done benchmark for DGEMM P=118-152 in steps of 2 and Q=2*P. It looks like the best overall (taking 4 threads as the most important) for both A76 I have is P=122, Q=244. For some reason, there is a significant drop for both A76 at the current setting P=128 Q=256.

Now I'm running benchmarks for all SGEMM, DGEMM, CGEMM, and ZGEMM (from benchmark folder) with all the combinations of parameters P, Q = 64 - 512 with steps of 64 and I'll report it later.

OpenMathLib / OpenBLAS

ARM Cortex-A76? #4581