Shrink aarch64 wheels - Githubissues

mattip commented 1 month ago

I wonder if the problem with aarch64 builds on travisCI is that we are running out of memory and the build process is killed (on manylinux/glibc). Travis has a 3GB limit. Similar to issue #144 and the PR #166, we should benchmark aarch64 on a high-end aarch64 machine.

@ev-br is this something you could do? Is the AWS m7g instance (with a graviton3 processor) advanced enough to use the THUNDERX3T110 kernels or is that targeting some other processor?

Mousius commented 1 month ago

The THUNDERX3T110 target uses AdvSIMD only, whereas the NEOVERSEV1 target on the AWS M7g can use SVE. Mostly the SVE targets remap back to NEOVERSEV1 at the moment, so removing that would be pretty bad for performance.

Mousius commented 1 month ago

I remapped any common targets back together in https://github.com/OpenMathLib/OpenBLAS/pull/4389, unsure how to tell which targets are less used and be removed 🤔

Also ref: https://github.com/OpenMathLib/OpenBLAS/blob/develop/Makefile.system#L686-L700

mattip commented 1 month ago

Thanks. Is NEOVERSEV1 active when using GCC (like in the build here)?

ev-br commented 1 month ago

BLAS-benchmarks runs on a c7g.large instance (https://aws.amazon.com/ec2/instance-types/c7g/) via https://github.com/OpenMathLib/BLAS-Benchmarks/blob/main/.cirun.yml Would this be enough?

Also, does @czgdp1807 benchmarking machinery handle aarch architectures?

Mousius commented 1 month ago

Thanks. Is NEOVERSEV1 active when using GCC (like in the build here)?

In manylinux2014 with GCC 10.2 you should get the SVE targets.

For certain toolchains, such as the MACOSX_DEPLOYMENT_TARGET, there isn't full support and it's disabled.

mattip commented 1 month ago

In manylinux2014 with GCC 10.2 you should get the SVE targets.

Cool, thanks

BLAS-benchmarks runs on a c7g.large

That is graviton3, so should be as good as it gets.

does @czgdp1807 benchmarking machinery handle aarch architectures?

I think so, you need to specify a different set of kernels. You can see which ones in the Maekfile.system from this comment

ev-br commented 1 month ago

Ok, one benchmark: this is Linux on arm64 not MacOS on a c7g.large machine on AWS:

{'arch': 'aarch64', 'cpu': '', 'machine': 'ip-172-31-6-241', 'num_cpu': '2', 'os': 'Linux 6.8.0-1009-aws', 'ram': '3899308', 'python': '3.12', 'Cython': '', 'build': '', 'packaging': ''}
bench_linalg.Eindot.time_matmul_a_b
| arch          |     mean |   spread |   perf_ratios |
|:--------------|---------:|---------:|--------------:|
| NEOVERSEV1    | 0.10003  | 0.000357 |       1       |
| ARMV8SVE      | 0.106404 | 0.000465 |       1.06372 |
| CORTEXA73     | 0.122021 | 0.00047  |       1.21984 |
| ARMV8         | 0.12206  | 0.0002   |       1.22023 |
| CORTEXA710    | 0.122363 | 0.000195 |       1.22326 |
| TSV110        | 0.122464 | 0.000285 |       1.22427 |
| CORTEXA510    | 0.122549 | 0.000155 |       1.22512 |
| NEOVERSEN1    | 0.122552 | 0.00038  |       1.22515 |
| FALKOR        | 0.122615 | 0.000345 |       1.22578 |
| CORTEXA72     | 0.122624 | 0.000125 |       1.22587 |
| A64FX         | 0.122658 | 0.000415 |       1.22621 |
| CORTEXX2      | 0.122666 | 0.00016  |       1.22628 |
| EMAG8180      | 0.122683 | 0.00029  |       1.22645 |
| CORTEXA76     | 0.122714 | 0.000335 |       1.22676 |
| CORTEXX1      | 0.122719 | 0.00028  |       1.22682 |
| FT2000        | 0.122807 | 0.00014  |       1.2277  |
| CORTEXA57     | 0.122884 | 0.00027  |       1.22847 |
| VORTEX        | 0.122974 | 0.00038  |       1.22937 |
| NEOVERSEN2    | 0.123136 | 0.00039  |       1.23099 |
| THUNDERX3T110 | 0.125751 | 0.000185 |       1.25713 |
| THUNDERX2T99  | 0.127315 | 0.00061  |       1.27276 |
| CORTEXA55     | 0.152537 | 0.000585 |       1.5249  |
| CORTEXA53     | 0.153044 | 0.000605 |       1.52998 |
| THUNDERX      | 0.241916 | 0.00081  |       2.41843 |

The rest of benchmarks are running, will see how different they look.

Mousius commented 1 month ago

It'd be good to test these on an r8g instance as well, as that has 128-bit SVE - with the c7g, you have 256-bit SVE so that the SVE kernels can perform differently. It's also worth noting that the A64FX target would benefit from being run on that specific core, as that has 512-bit SVE and slightly different kernels.

mattip commented 1 month ago

@Mousius could you weigh in about a possible set of kernels that make sense? Over at #166 I suggested ARMV8 CORTEXA57 NEOVERSEV1 THUNDERX, but had to use ARMV8 CORTEXA57 THUNDERX on the EOL musllinux_1_1 build since the gcc there (9.2) does not support SVE.

ev-br commented 1 month ago

full bench suite on c7g: https://gist.github.com/ev-br/c1a35b386c90d8eaac484520d8256927

Mousius commented 1 month ago

I've tried tweaking some constants in https://github.com/OpenMathLib/OpenBLAS/pull/4833, if we do this, we could potentially have ARMV8 and ARMV8SVE without losing too much 🤔

Do you mind benchmarking these changes @ev-br ?

ev-br commented 1 month ago

TL;DR: not easily, sadly. Unless your changes are visible on codspeed or (will be visible on) blas-benchmarks next Wednesaday after your PR merges. Or if you've a suggestion of how to extend either codspeed or blas-benchmarks setups to probe your changes.

There are two ways OpenBLAS benchmarks run currently:

On codspeed for each OpenBLAS pull request: https://codspeed.io/OpenMathLib/OpenBLAS/branches/Mousius:improve-sve-constants
Weekly on a c7g and m5 machines on AWS via cirun.io: http://www.openmathlib.org/BLAS-Benchmarks/ is the dashboars and https://github.com/OpenMathLib/BLAS-Benchmarks is the repository; it relies on scipy-openblas32 weekly builds from the anaconda nightly bucket (https://github.com/OpenMathLib/BLAS-Benchmarks/blob/main/.github/workflows/run_cirun_graviton.yml#L97)

Both were set up as a part of an STF project co-PI-ed by @martin-frbg and @rgommers . The AWS costs for blas-benchmarks weekly runs are also picked up by Quansight (I believe).

I'm happy to help extending the set of benchmarks these two services run --- do you have suggestions what would be useful to add? Large-scale restructurings I'm also happy to work on, but these will have to be cleared through Quansight first.

Neither of these has per-kernel granularity though. My one-off per-kernel runs of https://github.com/MacPython/openblas-libs/issues/144#issuecomment-2254556649 and https://github.com/MacPython/openblas-libs/issues/170#issuecomment-2260529416 are a bit different: these are numpy benchmarks, and also rely on scipy-openblas32 wheels. Also worth noting that these runs rely on benchmarking scripts by Matti and Gagan, developed as a part of some other Quansight funded effort, not sure which one.

I was only able to run these one-off experiments because a) Matti and Gagan had the benchmarking scripts, b) I have the AWS setup ready from the blas-benchmark work, and c) Quansight basically shrugged off the costs of a couple of hours of CPU and engineering time. I'm definitely happy to evolve either sets of benchmarks or set up some other strategy---when it's cleared with Quansight.

So possible concrete steps:

Easy ones:

I can easily repeat the run of https://github.com/MacPython/openblas-libs/issues/170#issuecomment-2260529416 using https://github.com/OpenMathLib/OpenBLAS/pull/4833 if there's an artifact I can copy over like in https://github.com/MacPython/openblas-libs/issues/144#issuecomment-2251958531. is there one?
is there something concrete to add to blas-benchmarks setup?

Needs some design:

do we want to automate benchmarking runs per kernel? Then
- which benchmarks: pure BLAS/LAPACK similar to codspeed or numpy benchmarks?
- nightly scipy-openblas wheels or from-source OpenBLAS builds?
- On AWS? Who picks up the bill then.
- How to trigger the runs? The numpy suite with per-kernel runs is \approx 1-2 hours. Might be too much for each OpenBLAS PR, so a weekly/nightly run? Or a manual trigger (similar to scipy wheel builds: then who triggers the runs).

rgommers commented 1 month ago

I think this is fairly low-prio? I'd move from TravisCI to Cirrus CI and be done with it to address the CI problem. The gain in binary size is much more limited than for x86-64, plus download numbers are way lower. So I don't think this is worth spending a lot of time on at the moment.

Mousius commented 1 month ago

Hi @ev-br,

I meant the benchmarks in https://github.com/MacPython/openblas-libs/issues/170#issuecomment-2260529416 only 😸

It those one-shot benchmarks show the ARMV8 target getting close enough to the NEOVERSEN1 target and the ARMV8SVE target getting close enough to the NEOVERSEV1 target then that's good indication that it'll work for a number of modern cores.

@mattip is it easy to use the infra in this repo to build from a my branch of OpenBLAS? It'd be easier than trying to recreate the build parameters you've used 😸

@rgommers understood, hopefully this minimal amount step is enough 😸

ev-br commented 1 month ago

I meant the benchmarks in #170 (comment) only

Yeah, a technical hurdle here is that numpy benchmarks need a python wheel, and I'm not sure how to generate one from a local OpenBLAS build.

martin-frbg commented 1 month ago

🐸only do flywheels, but perhaps it would be sufficient to replace the libscipy-openblas in numpy.libs with your identically named own build after installing the stock numpy wheel ?

MacPython / openblas-libs

Shrink aarch64 wheels #170