Open mattip opened 1 month ago
The THUNDERX3T110
target uses AdvSIMD only, whereas the NEOVERSEV1
target on the AWS M7g can use SVE. Mostly the SVE targets remap back to NEOVERSEV1
at the moment, so removing that would be pretty bad for performance.
I remapped any common targets back together in https://github.com/OpenMathLib/OpenBLAS/pull/4389, unsure how to tell which targets are less used and be removed 🤔
Also ref: https://github.com/OpenMathLib/OpenBLAS/blob/develop/Makefile.system#L686-L700
Thanks. Is NEOVERSEV1
active when using GCC (like in the build here)?
BLAS-benchmarks runs on a c7g.large instance (https://aws.amazon.com/ec2/instance-types/c7g/) via https://github.com/OpenMathLib/BLAS-Benchmarks/blob/main/.cirun.yml Would this be enough?
Also, does @czgdp1807 benchmarking machinery handle aarch architectures?
Thanks. Is
NEOVERSEV1
active when using GCC (like in the build here)?
In manylinux2014 with GCC 10.2 you should get the SVE targets.
For certain toolchains, such as the MACOSX_DEPLOYMENT_TARGET
, there isn't full support and it's disabled.
In manylinux2014 with GCC 10.2 you should get the SVE targets.
Cool, thanks
BLAS-benchmarks runs on a c7g.large
That is graviton3, so should be as good as it gets.
does @czgdp1807 benchmarking machinery handle aarch architectures?
I think so, you need to specify a different set of kernels. You can see which ones in the Maekfile.system from this comment
Ok, one benchmark: this is Linux on arm64 not MacOS on a c7g.large machine on AWS:
{'arch': 'aarch64', 'cpu': '', 'machine': 'ip-172-31-6-241', 'num_cpu': '2', 'os': 'Linux 6.8.0-1009-aws', 'ram': '3899308', 'python': '3.12', 'Cython': '', 'build': '', 'packaging': ''}
bench_linalg.Eindot.time_matmul_a_b
| arch | mean | spread | perf_ratios |
|:--------------|---------:|---------:|--------------:|
| NEOVERSEV1 | 0.10003 | 0.000357 | 1 |
| ARMV8SVE | 0.106404 | 0.000465 | 1.06372 |
| CORTEXA73 | 0.122021 | 0.00047 | 1.21984 |
| ARMV8 | 0.12206 | 0.0002 | 1.22023 |
| CORTEXA710 | 0.122363 | 0.000195 | 1.22326 |
| TSV110 | 0.122464 | 0.000285 | 1.22427 |
| CORTEXA510 | 0.122549 | 0.000155 | 1.22512 |
| NEOVERSEN1 | 0.122552 | 0.00038 | 1.22515 |
| FALKOR | 0.122615 | 0.000345 | 1.22578 |
| CORTEXA72 | 0.122624 | 0.000125 | 1.22587 |
| A64FX | 0.122658 | 0.000415 | 1.22621 |
| CORTEXX2 | 0.122666 | 0.00016 | 1.22628 |
| EMAG8180 | 0.122683 | 0.00029 | 1.22645 |
| CORTEXA76 | 0.122714 | 0.000335 | 1.22676 |
| CORTEXX1 | 0.122719 | 0.00028 | 1.22682 |
| FT2000 | 0.122807 | 0.00014 | 1.2277 |
| CORTEXA57 | 0.122884 | 0.00027 | 1.22847 |
| VORTEX | 0.122974 | 0.00038 | 1.22937 |
| NEOVERSEN2 | 0.123136 | 0.00039 | 1.23099 |
| THUNDERX3T110 | 0.125751 | 0.000185 | 1.25713 |
| THUNDERX2T99 | 0.127315 | 0.00061 | 1.27276 |
| CORTEXA55 | 0.152537 | 0.000585 | 1.5249 |
| CORTEXA53 | 0.153044 | 0.000605 | 1.52998 |
| THUNDERX | 0.241916 | 0.00081 | 2.41843 |
The rest of benchmarks are running, will see how different they look.
It'd be good to test these on an r8g
instance as well, as that has 128-bit SVE - with the c7g
, you have 256-bit SVE so that the SVE kernels can perform differently. It's also worth noting that the A64FX
target would benefit from being run on that specific core, as that has 512-bit SVE and slightly different kernels.
@Mousius could you weigh in about a possible set of kernels that make sense? Over at #166 I suggested ARMV8 CORTEXA57 NEOVERSEV1 THUNDERX
, but had to use ARMV8 CORTEXA57 THUNDERX
on the EOL musllinux_1_1 build since the gcc there (9.2) does not support SVE.
full bench suite on c7g: https://gist.github.com/ev-br/c1a35b386c90d8eaac484520d8256927
I've tried tweaking some constants in https://github.com/OpenMathLib/OpenBLAS/pull/4833, if we do this, we could potentially have ARMV8
and ARMV8SVE
without losing too much 🤔
Do you mind benchmarking these changes @ev-br ?
TL;DR: not easily, sadly. Unless your changes are visible on codspeed or (will be visible on) blas-benchmarks next Wednesaday after your PR merges. Or if you've a suggestion of how to extend either codspeed or blas-benchmarks setups to probe your changes.
There are two ways OpenBLAS benchmarks run currently:
scipy-openblas32
weekly builds from the anaconda nightly bucket (https://github.com/OpenMathLib/BLAS-Benchmarks/blob/main/.github/workflows/run_cirun_graviton.yml#L97)Both were set up as a part of an STF project co-PI-ed by @martin-frbg and @rgommers . The AWS costs for blas-benchmarks weekly runs are also picked up by Quansight (I believe).
I'm happy to help extending the set of benchmarks these two services run --- do you have suggestions what would be useful to add? Large-scale restructurings I'm also happy to work on, but these will have to be cleared through Quansight first.
Neither of these has per-kernel granularity though. My one-off per-kernel runs of https://github.com/MacPython/openblas-libs/issues/144#issuecomment-2254556649 and https://github.com/MacPython/openblas-libs/issues/170#issuecomment-2260529416 are a bit different: these are numpy benchmarks, and also rely on scipy-openblas32 wheels. Also worth noting that these runs rely on benchmarking scripts by Matti and Gagan, developed as a part of some other Quansight funded effort, not sure which one.
I was only able to run these one-off experiments because a) Matti and Gagan had the benchmarking scripts, b) I have the AWS setup ready from the blas-benchmark work, and c) Quansight basically shrugged off the costs of a couple of hours of CPU and engineering time. I'm definitely happy to evolve either sets of benchmarks or set up some other strategy---when it's cleared with Quansight.
So possible concrete steps:
Easy ones:
Needs some design:
I think this is fairly low-prio? I'd move from TravisCI to Cirrus CI and be done with it to address the CI problem. The gain in binary size is much more limited than for x86-64, plus download numbers are way lower. So I don't think this is worth spending a lot of time on at the moment.
Hi @ev-br,
I meant the benchmarks in https://github.com/MacPython/openblas-libs/issues/170#issuecomment-2260529416 only 😸
It those one-shot benchmarks show the ARMV8
target getting close enough to the NEOVERSEN1
target and the ARMV8SVE
target getting close enough to the NEOVERSEV1
target then that's good indication that it'll work for a number of modern cores.
@mattip is it easy to use the infra in this repo to build from a my branch of OpenBLAS? It'd be easier than trying to recreate the build parameters you've used 😸
@rgommers understood, hopefully this minimal amount step is enough 😸
I meant the benchmarks in #170 (comment) only
Yeah, a technical hurdle here is that numpy benchmarks need a python wheel, and I'm not sure how to generate one from a local OpenBLAS build.
🐸only do flywheels, but perhaps it would be sufficient to replace the libscipy-openblas in numpy.libs with your identically named own build after installing the stock numpy wheel ?
I wonder if the problem with aarch64 builds on travisCI is that we are running out of memory and the build process is killed (on manylinux/glibc). Travis has a 3GB limit. Similar to issue #144 and the PR #166, we should benchmark aarch64 on a high-end aarch64 machine.
@ev-br is this something you could do? Is the AWS m7g instance (with a graviton3 processor) advanced enough to use the
THUNDERX3T110
kernels or is that targeting some other processor?