Add Neoverse V2 and Armv9

AdhocMan commented 10 months ago

Add support for Armv9 and Neoverse v2, as required for the Nvidia Grace CPU.

alalazo commented 10 months ago

I'll follow with a PR referencing this on the archspec side and check whether tests passes. @fspiga Do you (or anybody from NVIDIA) want to double check this?

fspiga commented 9 months ago

I have staged changes for the similar type of enhancements in my local system and I have been testing things for few weeks.

For GCC/GNU, re use of -mtune and -march, see Compiler flags across architectures: -march, -mtune, and -mcpu. As long as someone is not cross compiling, we advise to just use -mcpu for 12.2 onward. It is possible to compile anything with a older GCC but we strongly advise to use a modern version of the compiler. On any Grace platform, a modern software stack equals higher chances of better generated code and performance.

Re NVHPC, IIRC neoverse-v2 supports in 23.3 was like a beta. We advise to go from 23.5 where we added a first wave of bug fixes and enhahncements. I will verify this.

Re CLANG, I am not sure the -msve-vector-bits=128 is necessary. The -mcpu=neoverse-v should already imply 128bits SIMD (it is a fixed uarch property). I will verify this.

It is worth to add Arm HPC Compiler in this mix. Support for V2 starts from version 23.04.1. We can have someone from Arm Ltd double-check this.

In my working copy, anything that is below the reccomended versions for Grace goes to neoverse_n1 or armv8.4a. This is because, at least in my mind, whoever gets a Grace system will also have modern Linux Kernel (6.2+) and a recent OS and take the step to use modern compilers. Internally we are not testing for performancer or optimal code generation on older GNU on Grace. We know that, thanks to Arm ecosystem, any generic armv8 binary with NEON enabled will run.

AdhocMan commented 9 months ago

For GCC/GNU, re use of -mtune and -march, see Compiler flags across architectures: -march, -mtune, and -mcpu. As long as someone is not cross compiling, we advise to just use -mcpu for 12.2 onward. It is possible to compile anything with a older GCC but we strongly advise to use a modern version of the compiler. On any Grace platform, a modern software stack equals higher chances of better generated code and performance.

As far as I know, GCC only added support for -mpcu=neoverse-v2 in version 13. Am I mistaken or are you suggesting to use a different cpu target for 12.2?

Re NVHPC, IIRC neoverse-v2 supports in 23.3 was like a beta. We advise to go from 23.5 where we added a first wave of bug fixes and enhahncements. I will verify this.

That would be great, thanks.

Re CLANG, I am not sure the -msve-vector-bits=128 is necessary. The -mcpu=neoverse-v should already imply 128bits SIMD (it is a fixed uarch property). I will verify this.

The documentation of both GCC and CLANG state that the default is 'scalable'. I could not very if this is included in the mcpu option, so I've kept it when possible. I'll remove it if you can verify that it is redundant.

It is worth to add Arm HPC Compiler in this mix. Support for V2 starts from version 23.04.1. We can have someone from Arm Ltd double-check this.

According to the release notes, it was added to 23.04.0: https://developer.arm.com/documentation/107578/2304/?lang=en I'll add the flags starting from that release.

In my working copy, anything that is below the reccomended versions for Grace goes to neoverse_n1 or armv8.4a. This is because, at least in my mind, whoever gets a Grace system will also have modern Linux Kernel (6.2+) and a recent OS and take the step to use modern compilers. Internally we are not testing for performancer or optimal code generation on older GNU on Grace. We know that, thanks to Arm ecosystem, any generic armv8 binary with NEON enabled will run.

I'd also expect that for any production workload, a recent compiler can be expected on Grace systems. But most other architectures specified here also add some tuning flags for older GCC versions as well. So in order to stay consistent, I took the 'neoverse_n1' specification as a guideline. For these older versions, I mainly relied on documentation only, so these flags are not necessarily the optimum, but represent an educated guess only.

willlovett-arm commented 9 months ago

Hi all,

(Simon - hi! I'm technology manager for the compiler teams at Arm).

The documentation of both GCC and CLANG state that the default is 'scalable'. I could not very if this is included in the mcpu option, so I've kept it when possible. I'll remove it if you can verify that it is redundant.

Correct - this example https://godbolt.org/z/a7a3En6TT shows that in practice.

Can I check: do we have a good reason for enabling width-specific flags here? If it's for a good reason (eg. we've explicitly written width-specific library code), then it's fine. If it's for performance reasons (eg. the compiler gets to assume another thing, so it should be faster, right?...), I'd be wary. We do almost all our optimization work on width-agnistic codegen. I'd go so far as to say: if you did have any examples where you saw a speed advantage with width-specific, please let me know, so we can put effort into fixing them.

I recognise that, in this particular case, this framework knows it's only ever building for neoverse-v2, so it's probably a moot point. But note that we (Arm) are keen to proliferate vector-length-agnostic code wherever possible, to reduce the impact of future vector length incompatibility.

Thanks!

Will.

alalazo commented 9 months ago

@willlovett-arm @dslarm Can you suggest changesets to this PR (I assume they amount to removing -msve-vector-bits=128)? I'll accept those and get this PR merged. Then we could discuss if any improvement is needed wrt flags, but I'd like to get neoverse-v2 support in the next Spack release which is happening early this week...

AdhocMan commented 9 months ago

Can I check: do we have a good reason for enabling width-specific flags here? If it's for a good reason (eg. we've explicitly written width-specific library code), then it's fine. If it's for performance reasons (eg. the compiler gets to assume another thing, so it should be faster, right?...), I'd be wary. We do almost all our optimization work on width-agnistic codegen. I'd go so far as to say: if you did have any examples where you saw a speed advantage with width-specific, please let me know, so we can put effort into fixing them.

I recognise that, in this particular case, this framework knows it's only ever building for neoverse-v2, so it's probably a moot point. But note that we (Arm) are keen to proliferate vector-length-agnostic code wherever possible, to reduce the impact of future vector length incompatibility.

Thanks!

Will.

Hi Will, thanks for providing feedback on this.

I think the fact that your compiler optimization work targets width-agnostic codegen is a convincing argument for removing the fixed sve vector length. The intention was to provide as much optimization opportunities as possible for the compiler. One of our applications could benefit from fixed vector width, but that should not translate into (potentially) worse optimization in general.

alalazo commented 9 months ago

Thanks @AdhocMan and everyone involved in the discussion!

archspec / archspec-json

Add Neoverse V2 and Armv9 #79