Add AArch64 extension handling and SVE/SVE2 feature detection

georges-arm commented 1 month ago

Refactor and add extensions for AArch64:

Add new helper functions in CommonDefARM.cpp and adjust call sites to mirror the existing x86 behaviour.
Amend the existing function names for the x86 extension handling to include "x86" in the name to distinguish from the new Arm cases.
Add new Arm extension enum ( ARM_VEXT ) values for the Arm Scalable Vector Extension (SVE) and SVE2 extensions.
Add Linux getauxval-based feature detection logic for the two new architecture features.
Amend the InitARM.cpp switch statements to continue to fall back to the Neon implementations for now.

adamjw24 commented 1 month ago

Are you planing to contribute some SVE/SVE2 code?

For now it seems a bit pointless. More so, because with the current build architecture its impossible to compile our x86 SIMD to SVE(2) using SIMDe. This makes the code basically placeholders, and we don't internally plan to use SVE or SVE2 anytime soon.

georges-arm commented 1 month ago

Are you planing to contribute some SVE/SVE2 code?

For now it seems a bit pointless. More so, because with the current build architecture its impossible to compile our x86 SIMD to SVE(2) using SIMDe. This makes the code basically placeholders, and we don't internally plan to use SVE or SVE2 anytime soon.

That's the plan yes, in particular a lot of the convolutions and SAD calculations can make use of the SVE-only 16-bit dot-product instructions: SDOT and UDOT.

I was aiming to put the first SVE patch and associated CMake changes up once there is a bit more Neon code to build on top of but figured the feature detection work could be a separate PR in the meantime. Let me know if you'd prefer me to combine this PR into a later one with the first SVE code instead.

adamjw24 commented 1 month ago

I don't quite know what to do about it. Maybe thats also because of the lack of understanding for the architectures of ARM.

So with x86 we have AVX2 and SSE4.1, with preference for AVX2 for CPUs supporting it, because its faster.

What about SVE, SVE2 and NEON? Would SVE be preferrable to NEON? Do all architectures supporting SVE also support NEON? What about SVE2?

Anyway, I'm thinking maybe the best way forward would be to keep the refactoring, and than put the SVE and SVE2 stuff in a macro that would for now not be enabled. So that its prepared but noone is distracted by SVE or SVE2 popping up in the --help, but basically not doing nothing. And than, when SVE code starts coming in, the macro can be either enabled or removed as enabled. What do you think?

georges-arm commented 1 month ago

I don't quite know what to do about it. Maybe thats also because of the lack of understanding for the architectures of ARM.

So for the 64-bit Arm architecture (aka AArch64 or Armv8-A) we have Neon (previously also called Advanced SIMD or ASIMD) mandatory from the start (v8.0). The Scalable Vector Extension (SVE) is introduced and is an optional extension from v8.2. SVE2 was then introduced as part of v9.0 (v9.0 is a strict superset of v8.5).

There are some good guides on and introductions to SVE available, for example Introduction to SVE. As a brief introduction, SVE provides:

Support for longer vector lengths as opposed to Neon which is a fixed 128-bit architecture. It's worth pointing out that unlike something like AVX512, in SVE the vector length may vary between different micro-architectures, for example the Neoverse N2 and Neoverse V2 cores have an SVE vector length of 128 bits but the Neoverse V1 has an SVE vector length of 256 bits. The HPC-focused Fujitsu A64FX core has an SVE vector length of 512 bits. This is enabled by adding several instructions that operate on the runtime vector length, for example INCB to increment a register by the vector length in bytes.
Support for pre-lane predicated vector load/store and data processing instructions (for example loading the first six lanes of a vector and zeroing the remainder or negating every-other lane of a vector, which is awkward in Neon).
A few new instructions like the 16-bit SDOT and UDOT instructions which I mentioned previously, as well as a few others.
Plus a few other things like gather/scatter and first-faulting loads for vectorising string routines, but these aren't really relevant to VVenC.

It's worth emphasising that even when the SVE vector length is 128-bits (the same as Neon) we still expect a small improvement over Neon due to the additional new instructions.

So with x86 we have AVX2 and SSE4.1, with preference for AVX2 for CPUs supporting it, because its faster.

What about SVE, SVE2 and NEON? Would SVE be preferrable to NEON? Do all architectures supporting SVE also support NEON? What about SVE2?

Neon remains mandatory and available to use even when SVE/SVE2 are available. This is similar to how the presence of AVX512 does not break existing SSE2 code etc. The idea would be to contribute SVE/SVE2 code only where it provides a performance improvement to do so, such that SVE kernels should always be preferred when it is possible to use them.

Anyway, I'm thinking maybe the best way forward would be to keep the refactoring, and than put the SVE and SVE2 stuff in a macro that would for now not be enabled. So that its prepared but noone is distracted by SVE or SVE2 popping up in the --help, but basically not doing nothing. And than, when SVE code starts coming in, the macro can be either enabled or removed as enabled. What do you think?

Options I can think of:

1) Leave as is. Slightly confusing since --help will say SVE or SVE2 on Neoverse V1 and later machines, but otherwise not actually an error. 2) Keep the refactoring in this PR but move the SVE detection/enablement part to a later PR (either (a) alongside initial SVE kernels or (b) as a separate thing). 3) Keep the refactoring + SVE detection/enablement in this PR but add some logic to the CMake to allow enabling/disabling it, default to disabled for now, with the intention to later enable it. 4) Same as (3) but default to enabled from the start.

I don't have a strong preference between those options, I'll be happy as long as we end up with SVE/SVE2 being enabled by default once we have some kernels that can actually use it. Having a CMake option to control compilation of newer architecture extensions might actually be useful for debugging, I was originally just trying to avoid doing too much in this PR.

Let me know your preference or if any other clarifications are needed and I'll adjust the PR as needed. Thanks!

adamjw24 commented 1 month ago

Thanks for the clarification.

First things first, I'd say lets go with 3. So basically introduce macro, similar to SIMD_ENABLED in TypeDef.h, which can than be controlled through CMake. Something like SUPPORT_ARM_SVE. For the it should be per default disabled, since it cannot be broadly used.

This confuses me as well tho, since you mentioned in #431 the kernels would not share much code between SVE and NEON. And now it sounds like SVE to NEON is like SSE4.2 to SSE3.0, basically just some additional instructions within the same framework. Or is the intrinsics syntax fully divergent between SVE and NEON?

georges-arm commented 1 month ago

First things first, I'd say lets go with 3. So basically introduce macro, similar to SIMD_ENABLED in TypeDef.h, which can than be controlled through CMake. Something like SUPPORT_ARM_SVE. For the it should be per default disabled, since it cannot be broadly used.

Ack!

This confuses me as well tho, since you mentioned in #431 the kernels would not share much code between SVE and NEON. And now it sounds like SVE to NEON is like SSE4.2 to SSE3.0, basically just some additional instructions within the same framework. Or is the intrinsics syntax fully divergent between SVE and NEON?

The Neon vector registers v0 - v31 and SVE vector registers z0 - z31 overlap: the lowest 128-bits of each SVE vector register zN is the same as the 128-bit Neon register vN, but the SVE vector registers may be larger. This is similar to the x86 register overlap introduced by xmm0 vs ymm0 vs zmm0.

Since the register sizes differ there are a different set of intrinsics to use here. For example the prototypes to add two vectors of uint32_t together:

Neon: uint32x4_t vaddq_u32(uint32x4_t a, uint32x4_t b); SVE: svuint32_t svadd_u32_x(svbool_t p, svuint32_t a, svuint32_t b); (The SVE intrinsics also take a predicate to mask certain operations, but the _x here indicates that the compiler that we do not care about the value of the unpredicated lanes and it is free to discard it)

To your point about code reuse between Neon and SVE: yes there are some cases where it will be beneficial to reuse the bulk of the existing Neon code and simply use the new SVE instructions by operating on the lowest 128-bits of the SVE registers, taking advantage of how the registers overlap. For these cases we would probably re-introduce headers as we need them to expose some common helper functions to avoid the duplication.

adamjw24 commented 1 month ago

Understand, thanks!

Makes me think we could structure the x86 intrinsics better as well. But well, problem for another day. Lets get this MR merged first.

georges-arm commented 1 month ago

Added new CMake flags VVENC_ENABLE_ARM_SIMD_SVE and VVENC_ENABLE_ARM_SIMD_SVE2 to control the individual new feature flags, such that we can add more incrementally in the future for any further new extensions. Also added a bit of logic such that you can't accidentally end up in a situation where e.g. SVE2 is enabled without Neon, since these combinations don't make sense.

# Configured with cmake ...
$ ./vvenc-build-neon/bin/vvencFFapp --help | head -n1
vvencFFapp: VVenC, the Fraunhofer H.266/VVC Encoder, version 1.12.0 [Linux][clang 19.1.0][64 bit][SIMD=NEON]

# Configured with cmake ... -DVVENC_ENABLE_ARM_SIMD_SVE=1
$ ./vvenc-build-sve/bin/vvencFFapp --help | head -n1
vvencFFapp: VVenC, the Fraunhofer H.266/VVC Encoder, version 1.12.0 [Linux][clang 19.1.0][64 bit][SIMD=SVE]

# Configured with cmake ... -DVVENC_ENABLE_ARM_SIMD_SVE=1 -DVVENC_ENABLE_ARM_SIMD_SVE2=1
$ ./vvenc-build-sve2/bin/vvencFFapp --help | head -n1
vvencFFapp: VVenC, the Fraunhofer H.266/VVC Encoder, version 1.12.0 [Linux][clang 19.1.0][64 bit][SIMD=SVE2]

fraunhoferhhi / vvenc

Add AArch64 extension handling and SVE/SVE2 feature detection #443