Closed georges-arm closed 1 month ago
Are you planing to contribute some SVE/SVE2 code?
For now it seems a bit pointless. More so, because with the current build architecture its impossible to compile our x86 SIMD to SVE(2) using SIMDe. This makes the code basically placeholders, and we don't internally plan to use SVE or SVE2 anytime soon.
Are you planing to contribute some SVE/SVE2 code?
For now it seems a bit pointless. More so, because with the current build architecture its impossible to compile our x86 SIMD to SVE(2) using SIMDe. This makes the code basically placeholders, and we don't internally plan to use SVE or SVE2 anytime soon.
That's the plan yes, in particular a lot of the convolutions and SAD calculations can make use of the SVE-only 16-bit dot-product instructions: SDOT and UDOT.
I was aiming to put the first SVE patch and associated CMake changes up once there is a bit more Neon code to build on top of but figured the feature detection work could be a separate PR in the meantime. Let me know if you'd prefer me to combine this PR into a later one with the first SVE code instead.
I don't quite know what to do about it. Maybe thats also because of the lack of understanding for the architectures of ARM.
So with x86 we have AVX2 and SSE4.1, with preference for AVX2 for CPUs supporting it, because its faster.
What about SVE, SVE2 and NEON? Would SVE be preferrable to NEON? Do all architectures supporting SVE also support NEON? What about SVE2?
Anyway, I'm thinking maybe the best way forward would be to keep the refactoring, and than put the SVE and SVE2 stuff in a macro that would for now not be enabled. So that its prepared but noone is distracted by SVE or SVE2 popping up in the --help
, but basically not doing nothing. And than, when SVE code starts coming in, the macro can be either enabled or removed as enabled. What do you think?
I don't quite know what to do about it. Maybe thats also because of the lack of understanding for the architectures of ARM.
So for the 64-bit Arm architecture (aka AArch64 or Armv8-A) we have Neon (previously also called Advanced SIMD or ASIMD) mandatory from the start (v8.0). The Scalable Vector Extension (SVE) is introduced and is an optional extension from v8.2. SVE2 was then introduced as part of v9.0 (v9.0 is a strict superset of v8.5).
There are some good guides on and introductions to SVE available, for example Introduction to SVE. As a brief introduction, SVE provides:
INCB
to increment a register by the vector length in bytes.It's worth emphasising that even when the SVE vector length is 128-bits (the same as Neon) we still expect a small improvement over Neon due to the additional new instructions.
So with x86 we have AVX2 and SSE4.1, with preference for AVX2 for CPUs supporting it, because its faster.
What about SVE, SVE2 and NEON? Would SVE be preferrable to NEON? Do all architectures supporting SVE also support NEON? What about SVE2?
Neon remains mandatory and available to use even when SVE/SVE2 are available. This is similar to how the presence of AVX512 does not break existing SSE2 code etc. The idea would be to contribute SVE/SVE2 code only where it provides a performance improvement to do so, such that SVE kernels should always be preferred when it is possible to use them.
Anyway, I'm thinking maybe the best way forward would be to keep the refactoring, and than put the SVE and SVE2 stuff in a macro that would for now not be enabled. So that its prepared but noone is distracted by SVE or SVE2 popping up in the
--help
, but basically not doing nothing. And than, when SVE code starts coming in, the macro can be either enabled or removed as enabled. What do you think?
Options I can think of:
1) Leave as is. Slightly confusing since --help
will say SVE or SVE2 on Neoverse V1 and later machines, but otherwise not actually an error.
2) Keep the refactoring in this PR but move the SVE detection/enablement part to a later PR (either (a) alongside initial SVE kernels or (b) as a separate thing).
3) Keep the refactoring + SVE detection/enablement in this PR but add some logic to the CMake to allow enabling/disabling it, default to disabled for now, with the intention to later enable it.
4) Same as (3) but default to enabled from the start.
I don't have a strong preference between those options, I'll be happy as long as we end up with SVE/SVE2 being enabled by default once we have some kernels that can actually use it. Having a CMake option to control compilation of newer architecture extensions might actually be useful for debugging, I was originally just trying to avoid doing too much in this PR.
Let me know your preference or if any other clarifications are needed and I'll adjust the PR as needed. Thanks!
Thanks for the clarification.
First things first, I'd say lets go with 3. So basically introduce macro, similar to SIMD_ENABLED
in TypeDef.h, which can than be controlled through CMake. Something like SUPPORT_ARM_SVE
. For the it should be per default disabled, since it cannot be broadly used.
This confuses me as well tho, since you mentioned in #431 the kernels would not share much code between SVE and NEON. And now it sounds like SVE to NEON is like SSE4.2 to SSE3.0, basically just some additional instructions within the same framework. Or is the intrinsics syntax fully divergent between SVE and NEON?
First things first, I'd say lets go with 3. So basically introduce macro, similar to
SIMD_ENABLED
in TypeDef.h, which can than be controlled through CMake. Something likeSUPPORT_ARM_SVE
. For the it should be per default disabled, since it cannot be broadly used.
Ack!
This confuses me as well tho, since you mentioned in #431 the kernels would not share much code between SVE and NEON. And now it sounds like SVE to NEON is like SSE4.2 to SSE3.0, basically just some additional instructions within the same framework. Or is the intrinsics syntax fully divergent between SVE and NEON?
The Neon vector registers v0 - v31
and SVE vector registers z0 - z31
overlap: the lowest 128-bits of each SVE vector register zN
is the same as the 128-bit Neon register vN
, but the SVE vector registers may be larger. This is similar to the x86 register overlap introduced by xmm0
vs ymm0
vs zmm0
.
Since the register sizes differ there are a different set of intrinsics to use here. For example the prototypes to add two vectors of uint32_t
together:
Neon: uint32x4_t vaddq_u32(uint32x4_t a, uint32x4_t b);
SVE: svuint32_t svadd_u32_x(svbool_t p, svuint32_t a, svuint32_t b);
(The SVE intrinsics also take a predicate to mask certain operations, but the _x
here indicates that the compiler that we do not care about the value of the unpredicated lanes and it is free to discard it)
To your point about code reuse between Neon and SVE: yes there are some cases where it will be beneficial to reuse the bulk of the existing Neon code and simply use the new SVE instructions by operating on the lowest 128-bits of the SVE registers, taking advantage of how the registers overlap. For these cases we would probably re-introduce headers as we need them to expose some common helper functions to avoid the duplication.
Understand, thanks!
Makes me think we could structure the x86 intrinsics better as well. But well, problem for another day. Lets get this MR merged first.
Added new CMake flags VVENC_ENABLE_ARM_SIMD_SVE
and VVENC_ENABLE_ARM_SIMD_SVE2
to control the individual new feature flags, such that we can add more incrementally in the future for any further new extensions. Also added a bit of logic such that you can't accidentally end up in a situation where e.g. SVE2 is enabled without Neon, since these combinations don't make sense.
# Configured with cmake ...
$ ./vvenc-build-neon/bin/vvencFFapp --help | head -n1
vvencFFapp: VVenC, the Fraunhofer H.266/VVC Encoder, version 1.12.0 [Linux][clang 19.1.0][64 bit][SIMD=NEON]
# Configured with cmake ... -DVVENC_ENABLE_ARM_SIMD_SVE=1
$ ./vvenc-build-sve/bin/vvencFFapp --help | head -n1
vvencFFapp: VVenC, the Fraunhofer H.266/VVC Encoder, version 1.12.0 [Linux][clang 19.1.0][64 bit][SIMD=SVE]
# Configured with cmake ... -DVVENC_ENABLE_ARM_SIMD_SVE=1 -DVVENC_ENABLE_ARM_SIMD_SVE2=1
$ ./vvenc-build-sve2/bin/vvencFFapp --help | head -n1
vvencFFapp: VVenC, the Fraunhofer H.266/VVC Encoder, version 1.12.0 [Linux][clang 19.1.0][64 bit][SIMD=SVE2]
Refactor and add extensions for AArch64:
Add new helper functions in
CommonDefARM.cpp
and adjust call sites to mirror the existing x86 behaviour.Amend the existing function names for the x86 extension handling to include "x86" in the name to distinguish from the new Arm cases.
Add new Arm extension enum (
ARM_VEXT
) values for the Arm Scalable Vector Extension (SVE) and SVE2 extensions.Add Linux
getauxval
-based feature detection logic for the two new architecture features.Amend the
InitARM.cpp
switch statements to continue to fall back to the Neon implementations for now.