Closed jfikar closed 3 months ago
Hi @jfikar,
The DYNAMIC_ARCH
selection happens in https://github.com/OpenMathLib/OpenBLAS/blob/develop/driver/others/dynamic_arm64.c. There's nothing here to detect this part, so it makes sense it'd fall back to the generic ARMV8
, as that's safest.
Looking at https://github.com/OpenMathLib/OpenBLAS/blob/develop/param.h#L3308C2-L3308C4 and https://github.com/OpenMathLib/OpenBLAS/blob/develop/kernel/arm64/KERNEL.CORTEXX1, the CORTEXX1
target is an alias for CORTEXA57
- except without the assumption around caches. Did you try the CORTEXA57
target?
If that works, I suggest we add the correct part (I think it's 0xD0B
) as an alias for CORTEXA57
in a similar way.
Autodetection is problematic on Big.little systems as the returned ID may depend on system load - if the system has been idle for a while, the little cores will get detected, if there has been any compute/compile load immediately prior to the check, it will pick up the big one(s). At runtime, you can use export OPENBLAS_VERBOSE=2
to make it report what it selected, or OPENBLAS_CORETYPE=CORTEXX1
to force selection of a particular target. (I guess you could try NEOVERSEN1 as an existing variant, but this will probably suffer from wrong assumptions about the available cache too. CORTEXA57 is the generic role model for Cortex-A, but I expect performance difference to ARMV8 to be minimal
The OPENBLAS_VERBOSE=2 works, thanks. On the Raspberry Pi5 it gives armv8 and on Rock-5b it gives cortexa55. Could not get other answer, but it may be possible. And I agree, the auto-detection is difficult on big.little.
These are the results for armv8, cortexa57, and neoversen1 targets, which were suggested:
armv8: | Threads | GFlop/s RPi | ops/cycle/core RPi | GFlop/s Rock | ops/cycle/core Rock |
---|---|---|---|---|---|
1 | 15.31 | 6.4 | 15.86 | 6.9 | |
2 | 28.71 | 6.0 | 31.68 | 6.9 | |
3 | 35.01 | 4.9 | 46.38 | 6.4 | |
4 | 39.17 | 4.1 | 59.69 | 6.5 |
cortexa57: | Threads | GFlop/s RPi | ops/cycle/core RPi | GFlop/s Rock | ops/cycle/core Rock |
---|---|---|---|---|---|
1 | 15.33 | 6.4 | 16.02 | 7.0 | |
2 | 28.60 | 6.0 | 31.67 | 6.9 | |
3 | 34.90 | 4.8 | 46.73 | 6.8 | |
4 | 39.01 | 4.1 | 59.31 | 6.4 |
neoversen1: | Threads | GFlop/s RPi | ops/cycle/core RPi | GFlop/s Rock | ops/cycle/core Rock |
---|---|---|---|---|---|
1 | 16.07 | 6.7 | 16.53 | 7.2 | |
2 | 31.27 | 6.5 | 33.15 | 7.2 | |
3 | 39.73 | 5.5 | 48.28 | 7.0 | |
4 | 21.69 | 2.3 | 62.66 | 6.8 |
I also did Blis cortexa57: | Threads | GFlop/s RPi | ops/cycle/core RPi | GFlop/s Rock | ops/cycle/core Rock |
---|---|---|---|---|---|
1 | 16.53 | 6.9 | 16.21 | 7.0 | |
2 | 31.48 | 6.6 | 32.46 | 7.1 | |
3 | 43.96 | 6.1 | 46.75 | 6.8 | |
4 | 50.28 | 5.2 | 60.12 | 6.5 |
The neoversen1 is good on the Rock-5B Cortex-A76. On Raspberry Pi5 it is probably cache limited again as the cortexx1 target.
The cortexx1 target with reduced DGEMM_DEFAULT_P and DGEMM_DEFAULT_Q values performs as cortexa57 and as armv8. It is not so drastically slowed down for all the 4 cores, but still it is somehow under-performing, not even reaching 40 GFlop/s. Blis gives 50 and 55 GFlop/s when using cortexa57 and firestorm targets.
It is strange that these two Cortex-A76 perform so differently, when the only difference is 3MB L3 cache instead of 2MB.
Thanks. That pronounced difference between the two boards is indeed strange - I'd be tempted to alias NeoverseN1 if it did not take such a plunge at 4 cores. Might be worth playing with the SWITCH_RATIO for NEOVERSEN1 in param.h (or introducing same for A57/ARMV8) - or perhaps it is something else like temperature/frequency management ?
It's a fair point. I trust more the frequencies on Raspberry Pi 5. It has a tool to check, if the thermal throttling had occurred vcgencmd get_throttled
. It still shows 0x0, i.e. no throttling occurred since the last restart. But I have measured maximum 78C, which is close to the first thermal throttling limit of 80C. So I'll repeat some of the measurements with an additional fan.
The Rock-5B needs additional fan as without it it hits 75C and silently starts throttling. With the additional cooling it is below 65C.
SWITCH_RATIO is 8 for neoversen1 and double, while for A57/armv8 it is 2, right? I can try to change it to 4 and 2.
Also neoversen1 has DGEMM_DEFAULT_P=240 and DGEMM_DEFAULT_Q=320. That's more than 160, 128 for A57/armv8, but it is less than 256, 512 for cortexx1.
Here are the new results for Raspberry Pi 5 with additional cooling. I checked that the temperature was below 65C all the time, so far away from the 80C limit. The results are slightly better, but at maximum by 7%, so it still does not explain the drop with 4 threads.
I'm showing the results for neoversen1, RPi5:
Threads | GFlop/s | ops/cycle/core | GFlop/s +extra cooling | ops/cycle/core +extra cooling |
---|---|---|---|---|
1 | 16.07 | 6.7 | 16.14 | 6.7 |
2 | 31.27 | 6.5 | 31.32 | 6.5 |
3 | 39.73 | 5.5 | 40.68 | 5.7 |
4 | 21.69 | 2.3 | 23.12 | 2.4 |
Also the good results of Blis at 4 threads suggest that it is probably not thermal effect.
I only brought up the cooling issue because the performance with 4 threads was so different between the two boards. Does it react to changes in SWITCH_RATIO at all ? (I do not think we have ever seen effects bigger than about 10 percent with poor choices for either SWITCH_RATIO or P,Q, and those would be more likely to depend on L2 than L3)
I have new results. There is no effect of the SWITCH_RATIO. I have tried the default 8, 4, and 2. The effect of reducing P and Q is similar to what we have seen before.
Neoverse N1 P=240 Q=320 (default) 4 threads 22.69 GFlop/s Neoverse N1 P=160 Q=128 (cortexa57) 4 threads 39.75 GFlop/s
Still can't get even past 40 GFlop/s. How do you properly adjust the P and Q values?
Good question, dark art (and probably lost knowledge). P*Q is supposed to be roughly equal to half the L2 cache for a start but beyond that point it is probably down to benchmarking. I now wonder if changing (halving/doubling) DGEMM_DEFAULT_R has any influence (though it might hurt single-thread performance).
The R=2048, 4096 or 8192 does not have much influence. However, it seems I have found a good combination of P and Q, which performs well on both my A76s.
0: P=240 Q=320 (Neoverse N1 default) | Threads | RPi5 GFlop/s R=2048 | RPi5 GFlop/s R=4096 | RPi5 GFlop/s R=8192 | Rock5 GFlop/s R=2048 | Rock5 GFlop/s R=4096 | Rock5 GFlop/s R=8192 |
---|---|---|---|---|---|---|---|
1 | 15.99 | 16.06 | 16.18 | 17.15 | 17.19 | 17.21 | |
2 | 30.71 | 31.12 | 31.12 | 33.39 | 33.42 | 33.43 | |
3 | 38.88 | 41.02 | 41.32 | 48.47 | 48.49 | 48.55 | |
4 | 22.80 | 22.33 | 22.17 | 62.69 | 62.54 | 62.89 |
1: P=120 Q=160 (Neoverse N1 divided by 2) | Threads | RPi5 GFlop/s R=2048 | RPi5 GFlop/s R=4096 | RPi5 GFlop/s R=8192 | Rock5 GFlop/s R=2048 | Rock5 GFlop/s R=4096 | Rock5 GFlop/s R=8192 |
---|---|---|---|---|---|---|---|
1 | 15.17 | 15.01 | 14.95 | 16.46 | 16.36 | 16.34 | |
2 | 28.48 | 28.32 | 28.56 | 31.84 | 31.82 | 31.93 | |
3 | 33.77 | 34.71 | 35.59 | 46.96 | 46.89 | 46.89 | |
4 | 37.94 | 38.67 | 39.23 | 61.27 | 61.23 | 61.20 |
2: P=160 Q=128 (CortexA57 default) | Threads | RPi5 GFlop/s R=2048 | RPi5 GFlop/s R=4096 | RPi5 GFlop/s R=8192 | Rock5 GFlop/s R=2048 | Rock5 GFlop/s R=4096 | Rock5 GFlop/s R=8192 |
---|---|---|---|---|---|---|---|
1 | 15.42 | 15.29 | 15.24 | 16.62 | 16.55 | 16.40 | |
2 | 28.91 | 28.74 | 28.79 | 32.35 | 32.36 | 32.35 | |
3 | 34.75 | 35.95 | 37.05 | 47.67 | 47.66 | 47.66 | |
4 | 37.68 | 39.53 | 41.23 | 62.20 | 62.20 | 62.23 |
3: P=80 Q=64 (CortexA57 divided by 2) | Threads | RPi5 GFlop/s R=2048 | RPi5 GFlop/s R=4096 | RPi5 GFlop/s R=8192 | Rock5 GFlop/s R=2048 | Rock5 GFlop/s R=4096 | Rock5 GFlop/s R=8192 |
---|---|---|---|---|---|---|---|
1 | 14.02 | 13.54 | 13.28 | 15.08 | 15.05 | 14.85 | |
2 | 21.62 | 20.91 | 20.96 | 29.28 | 28.73 | 28.54 | |
3 | 20.17 | 20.14 | 20.87 | 37.31 | 37.15 | 37.42 | |
4 | 20.15 | 20.32 | 21.25 | 48.54 | 48.45 | 48.48 |
4: P=256 Q=512 (Cortex-X1 default) | Threads | RPi5 GFlop/s R=2048 | RPi5 GFlop/s R=4096 | RPi5 GFlop/s R=8192 | Rock5 GFlop/s R=2048 | Rock5 GFlop/s R=4096 | Rock5 GFlop/s R=8192 |
---|---|---|---|---|---|---|---|
1 | 15.61 | 15.69 | 15.73 | 16.91 | 16.79 | 16.80 | |
2 | 30.19 | 30.39 | 30.69 | 32.91 | 32.72 | 32.96 | |
3 | 39.00 | 39.47 | 39.89 | 47.73 | 47.90 | 47.87 | |
4 | 21.25 | 20.26 | 20.77 | 61.80 | 61.72 | 61.95 |
5: P=128 Q=256 (Cortex-X1 divided by 2) | Threads | RPi5 GFlop/s R=2048 | RPi5 GFlop/s R=4096 | RPi5 GFlop/s R=8192 | Rock5 GFlop/s R=2048 | Rock5 GFlop/s R=4096 | Rock5 GFlop/s R=8192 |
---|---|---|---|---|---|---|---|
1 | 16.22 | 16.20 | 16.20 | 16.98 | 17.00 | 16.86 | |
2 | 30.98 | 31.00 | 31.16 | 33.11 | 33.24 | 33.14 | |
3 | 42.28 | 42.54 | 43.66 | 48.98 | 48.96 | 49.06 | |
4 | 46.55 | 47.26 | 47.96 | 64.22 | 64.13 | 64.07 |
6: P=128 Q=160 (reversed 2) | Threads | RPi5 GFlop/s R=2048 | RPi5 GFlop/s R=4096 | RPi5 GFlop/s R=8192 | Rock5 GFlop/s R=2048 | Rock5 GFlop/s R=4096 | Rock5 GFlop/s R=8192 |
---|---|---|---|---|---|---|---|
1 | 15.25 | 15.09 | 15.04 | 16.54 | 16.28 | 16.42 | |
2 | 28.08 | 28.22 | 28.10 | 31.63 | 31.80 | 31.77 | |
3 | 33.93 | 34.47 | 35.56 | 46.36 | 45.96 | 46.00 | |
4 | 37.26 | 37.91 | 39.59 | 59.07 | 59.07 | 59.07 |
7: P=160 Q=224 (between 0 and 1) | Threads | RPi5 GFlop/s R=2048 | RPi5 GFlop/s R=4096 | RPi5 GFlop/s R=8192 | Rock5 GFlop/s R=2048 | Rock5 GFlop/s R=4096 | Rock5 GFlop/s R=8192 |
---|---|---|---|---|---|---|---|
1 | 15.49 | 15.34 | 15.29 | 16.63 | 16.41 | 16.41 | |
2 | 28.62 | 28.68 | 28.69 | 32.08 | 31.95 | 31.97 | |
3 | 34.27 | 35.40 | 36.22 | 46.77 | 46.89 | 46.61 | |
4 | 37.83 | 39.22 | 41.05 | 59.99 | 59.96 | 60.22 |
8: P=112 Q=96 (between 2 and 3) | Threads | RPi5 GFlop/s R=2048 | RPi5 GFlop/s R=4096 | RPi5 GFlop/s R=8192 | Rock5 GFlop/s R=2048 | Rock5 GFlop/s R=4096 | Rock5 GFlop/s R=8192 |
---|---|---|---|---|---|---|---|
1 | 14.87 | 14.57 | 14.41 | 15.98 | 15.91 | 15.72 | |
2 | 26.02 | 25.98 | 26.19 | 30.97 | 30.94 | 30.81 | |
3 | 27.23 | 27.77 | 29.13 | 45.32 | 45.09 | 45.15 | |
4 | 27.77 | 28.28 | 29.19 | 59.00 | 58.91 | 58.96 |
9: P=176 Q=352 (between 4 and 5) | Threads | RPi5 GFlop/s R=2048 | RPi5 GFlop/s R=4096 | RPi5 GFlop/s R=8192 | Rock5 GFlop/s R=2048 | Rock5 GFlop/s R=4096 | Rock5 GFlop/s R=8192 |
---|---|---|---|---|---|---|---|
1 | 16.29 | 16.40 | 16.45 | 17.21 | 17.20 | 17.05 | |
2 | 31.15 | 31.39 | 31.43 | 33.08 | 33.05 | 33.10 | |
3 | 42.98 | 42.76 | 42.88 | 48.32 | 48.28 | 48.31 | |
4 | 38.90 | 37.13 | 34.63 | 62.30 | 62.40 | 62.30 |
It looks like the choice 5: is the fastest for 4 threads on both A76s. I have more benchmarks. But 5: is up to now the best.
Thank you very much for the extensive testing. Guess we can go with choice 5 for 0.3.27 and do some tweaks later if necessary (including to the S/C/Z GEMM parameters, which I guess will start out as 1/2 the CortexX1 value too). Interesting that GEMM_R has so little effect, this was the parameter I'd have thought most likely to be connected to L3 size...
Looks good. Only the latest P, Q, R parameters were benchmarked using neoversen1 target and kernel. And you propose cortexa57 kernel. Maybe the confusion is due to the optimal P and Q to be half of the ones of cortexx1, which has cortexa57 kernel?
I can try with cortexa57 kernel as well, to see, if it is better.
Makes no difference for the GEMM kernels actually (except that N1 defines the SWITCH_RATIO parameter for level3 BLAS while A57 doesn't, but I'm sceptical if it is significant), N1 only differs in choices for a handful of mostly level1 kernels
OK. I believe I discovered, what may cause the different behavior of RPi5 and Rock5B, despite they both are Cortex-A76 with almost the same frequency.
It turns out, the Rock5B (and other RK3588 SBC) has twice as fast RAM bandwidth (30GB/s), when compared to RPi 5 (15GB/s).
https://github.com/ThomasKaiser/sbc-bench/blob/master/Results.md
Then the Rock5B is less sensitive to the right choice of P and Q, as it suffers less, when the problem does no more fit in caches.
Ok, that might well be the case... I guess I'll merge as-is and things can be further improved later if/where necessary.
The new target cortexa76 seems to be a bit slower on both my machines than the neoversen1 with P=128 and Q=256.
Threads | GFlop/s A76 RPi | GFlop/s N1 RPi | GFlop/s A76 Rock5 | GFlop/s N1 Rock5 |
---|---|---|---|---|
1 | 15.60 | 16.25 | 16.15 | 17.00 |
2 | 29.77 | 31.19 | 31.35 | 33.24 |
3 | 40.44 | 42.54 | 45.90 | 48.96 |
4 | 44.92 | 48.12 | 58.90 | 64.41 |
I don't know why.
I also tried the DYNAMIC_ARCH, but it seems cortexa76 is not yet included:
Falling back to generic ARMV8 core
Core: armv8
Forcing OPENBLAS_CORETYPE=CORTEXA76
gives
Core not found: CORTEXA76
Falling back to generic ARMV8 core
This is on RPi5, to avoid big.little problems.
Hmm, not sure why it would be slower - pretty sure I have copied the correct data in #4597 . (I do realize now that the N1 has a full set of optimized TRMM kernels and better copy kernels for SGEMM, but that should not affect DGEMM performance at all. Does it get faster for you when you copy KERNEL.VORTEX over KERNEL.CORTEXA76 (quick hack to make it include the N1 kernels instead of A57) ? )
I have not (yet?) included the new target in DYNAMIC_ARCH as we'd had cut back on the number of dedicated arm64 targets just recently (#4389) - should probably have this cpu fall back to either A57 or N1 rather than generic ARMV8 though.
It is strange, the faster results of modified neoversen1 are from 0.3.26 and the cortexa76 results are from development. If I do neoversen1 or vortex kernel in development, it is the same and corresponds to the slower numbers.
So it seems your PR is fine, but something happened between 0.3.26 - development. Do you suspect a certain commit, or should I try bisecting?
The only relevant one should be #4585 (to cap the number of threads) that I merged only yesterday, but it is not supposed to trigger (and would make your results drastically worse across all targets if it went wrong). Apart from that, NeoverseN1 and Vortex lost their slightly faster DNRM2 due to #4595 but that should not matter for DGEMM.
And 4585 wasn't merged when you reported the slower numbers yesterday
Going to bisect. There is a lot of commits between 0.3.36 and today. This is good for the project, but not so much for the bisecting:
Bisecting: 212 revisions left to test after this (roughly 8 steps)
BTW I saw that arm64 DYNAMIC_ARCH is slimmer than it used to be. For 0.3.25 the binary with statically linked OpenBLAS was 13MB, while 0.3.26 only 8.8MB. Still, a single target binary is only 300kB.
Yes, it is not easy to find a balance between releasing so often that nobody has a chance to keep up, and releasing so late that the number of changes becomes daunting - two or three months between releases seemed to give the best tradeoff on average. Though of course it will vary once you add unexpected code contributions and unexpected events in real life...
I had a problem doing the bisect: I was getting only the lower numbers. Even for the 0.3.26 I was not able to reproduce my own results in the big table 5.
It turns out that I sort of messed the compilation of OpenBLAS during the 0-9 tests (the very first one 0.2048 is fine though).
So what I did: compiled in a loop with different param.h like this:
copy param.h.x.y to param.h
make -j4
make PREFIX= install
It turns out that the resulting binary files are different if I do it correctly with make clean
. Although the timestamp on param.h is updated by cp, so make should recompile all necessary files automatic, shouldn't it? It does for a lot of files, but misses some.
copy param.h.x.y to param.h
make clean
make -j4
make PREFIX= install
I see couple of *gemm*.o
files not updated by the first approach, but some should have been recompiled?
$ find . -name '*gemm*.o' | xargs ls -l -t
...
-rw-r--r-- 1 rock rock 2032 Apr 7 04:06 ./kernel/zgemm_otcopy.o
-rw-r--r-- 1 rock rock 1584 Apr 7 04:06 ./kernel/zgemm_beta.o
-rw-r--r-- 1 rock rock 1952 Apr 7 04:06 ./kernel/zgemm_oncopy.o
-rw-r--r-- 1 rock rock 12592 Apr 7 04:06 ./kernel/zgemm_kernel_l.o
-rw-r--r-- 1 rock rock 12656 Apr 7 04:06 ./kernel/zgemm_kernel_b.o
-rw-r--r-- 1 rock rock 12656 Apr 7 04:06 ./kernel/zgemm_kernel_r.o
-rw-r--r-- 1 rock rock 12592 Apr 7 04:06 ./kernel/zgemm_kernel_n.o
-rw-r--r-- 1 rock rock 1584 Apr 7 04:06 ./kernel/cgemm_beta.o
-rw-r--r-- 1 rock rock 1904 Apr 7 04:06 ./kernel/cgemm_otcopy.o
-rw-r--r-- 1 rock rock 1848 Apr 7 04:06 ./kernel/cgemm_incopy.o
-rw-r--r-- 1 rock rock 1784 Apr 7 04:06 ./kernel/cgemm_itcopy.o
-rw-r--r-- 1 rock rock 2312 Apr 7 04:06 ./kernel/cgemm_oncopy.o
-rw-r--r-- 1 rock rock 14520 Apr 7 04:06 ./kernel/cgemm_kernel_b.o
-rw-r--r-- 1 rock rock 14392 Apr 7 04:06 ./kernel/cgemm_kernel_l.o
-rw-r--r-- 1 rock rock 14520 Apr 7 04:06 ./kernel/cgemm_kernel_r.o
-rw-r--r-- 1 rock rock 14392 Apr 7 04:06 ./kernel/cgemm_kernel_n.o
-rw-r--r-- 1 rock rock 1288 Apr 7 04:06 ./kernel/dgemm_otcopy.o
-rw-r--r-- 1 rock rock 1304 Apr 7 04:06 ./kernel/dgemm_oncopy.o
-rw-r--r-- 1 rock rock 1896 Apr 7 04:06 ./kernel/dgemm_itcopy.o
-rw-r--r-- 1 rock rock 1904 Apr 7 04:06 ./kernel/dgemm_incopy.o
-rw-r--r-- 1 rock rock 9192 Apr 7 04:06 ./kernel/dgemm_kernel.o
-rw-r--r-- 1 rock rock 1352 Apr 7 04:06 ./kernel/dgemm_beta.o
-rw-r--r-- 1 rock rock 1872 Apr 7 04:06 ./kernel/sgemm_otcopy.o
-rw-r--r-- 1 rock rock 3184 Apr 7 04:06 ./kernel/sgemm_incopy.o
-rw-r--r-- 1 rock rock 1344 Apr 7 04:06 ./kernel/sgemm_oncopy.o
-rw-r--r-- 1 rock rock 2344 Apr 7 04:06 ./kernel/sgemm_itcopy.o
-rw-r--r-- 1 rock rock 10232 Apr 7 04:06 ./kernel/sgemm_kernel.o
-rw-r--r-- 1 rock rock 1416 Apr 7 04:06 ./kernel/sgemm_beta.o
-rw-r--r-- 1 rock rock 2184 Apr 7 04:06 ./driver/level3/gemm_thread_variable.o
-rw-r--r-- 1 rock rock 2896 Apr 7 04:06 ./driver/level3/gemm_thread_mn.o
-rw-r--r-- 1 rock rock 1880 Apr 7 04:06 ./driver/level3/gemm_thread_n.o
-rw-r--r-- 1 rock rock 1880 Apr 7 04:06 ./driver/level3/gemm_thread_m.o
The good news is: there is no regression it the OpenBLAS code itself. I compiled it incorrectly by omitting make clean
to save some time.
The bad news is that the parameter choice 5 is not so good anymore. I'm running all the tests again now.
That's a bit unfortunate (mostly because of the time you spent on it), but it was "advertised" as initial support in the Changelog and Release announcement... With the recursive inclusions in the Makefiles and the parallel build I think it has never been safe to rely on make
rebuilding all the changed files, as far as I remember it was always make clean; make
even with the original GotoBLAS.
I've done benchmark for DGEMM P=118-152 in steps of 2 and Q=2*P. It looks like the best overall (taking 4 threads as the most important) for both A76 I have is P=122, Q=244. For some reason, there is a significant drop for both A76 at the current setting P=128 Q=256.
Now I'm running benchmarks for all SGEMM, DGEMM, CGEMM, and ZGEMM (from benchmark folder) with all the combinations of parameters P, Q = 64 - 512 with steps of 64 and I'll report it later.
Hello, OpenBLAS-0.3.26 does currently not support Cortex-A76. Is there a way to add it? I believe, popularity of Cortex-A76 will increase due to Raspberry Pi5, Orange Pi5 and Radaxa Rock-5B (the latter two have big.little cores A76+A55).
I did some testing though. The fastest for HPL (dgemm) seems to be Cortex-X1. But it seems, there is again a problem with cache, as with Raspberry Pi 4. I.e. the single core performance is close to the theoretical 8 ops/cycle/core, but by increasing the thread count it gets significantly worse.
Raspberry Pi5, 4x Cortex-A76 2.4GHz, 512KB L2 per core, 2MB shared L3, OpenBLAS-0.3.26, CORTEXX1, HPL N=28000, NB=240
This is with the official heatsink and active cooler. There is no thermal throttling detected.
There is no slowdown on Rock-5B, which differs only in slightly bigger L3 cache (and slightly lower speed).
Rock-5B, testing just the big cluster: 4x Cortex-A76 2.3GHz, 512KB L2 per core, 3MB shared L3, OpenBLAS-0.3.26, CORTEXX1, HPL N=28000, NB=240
This is also with the official heatsink and active cooler. But the Rock-5B needs a little bit more cooling (extra fan), otherwise there is thermal throttling. Moreover, the frequencies are reported still as not being throttled.
I did also test the Blis library from git. It looks like the firestorm (meant for Apple M1 ARM) target is the fastest on Cortex-A76. There is no significant slowdown.
Raspberry Pi5, Blis git, firestorm
Rock-5B, Blis git, firestorm
Going back to OpenBLAS, I looked in the param.h file and CORTEXX1 seems to be handled as a CPU with large cache even if it has less then 8 cores.
I tried to adjust the DGEMM_DEFAULT_P and DGEMM_DEFAULT_Q to lower values of 160 and 128. It seems to help a bit, but not completely. It hurts Rock-5B though, but not much.
Raspberry Pi5, OpenBLAS, CORTEXX1, DGEMM_DEFAULT_P=160, DGEMM_DEFAULT_Q=128
Rock-5B, OpenBLAS, CORTEXX1, DGEMM_DEFAULT_P=160, DGEMM_DEFAULT_Q=128
Is there something else to try?
BTW. the auto-detection during build of OpenBLAS chooses target ARMV8 on Raspberry Pi and CORTEXA55 on Rock-5B (maybe because the little cores are A55 and it is also the core number 0?). I guess, the same happens if DYNAMIC_ARCH is selected? Is there a way to tell, which of the targets is chosen using DYNAMIC_ARCH?