Discussion: ARM SVE Extensions

dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.

https://docs.microsoft.com/dotnet/core/

MIT License

15.44k stars 4.76k forks source link

Discussion: ARM SVE Extensions #13781

Open tannergooding opened 5 years ago

tannergooding commented 5 years ago

Forked off from: https://github.com/dotnet/coreclr/pull/23899#issuecomment-551881728

The .NET Runtime may eventually want to support the ARM SVE Extensions. These types have some interesting characteristics that may be worth further discussion.

tannergooding commented 5 years ago

@TamarChristinaArm:

It would seem that if one were doing interop with vector types, one would be better off with the fixed-size HW intrinsic types, and if one wants to interop with Vector one should convert to/from fixed-size HW intrinsic types.

Sorry, haven't been following the discussion much, but wouldn't this cause a problem in the future with something like SVE? where the user really doesn't know the size of the Vector?

It's probably not relevant for now but just wondering.

tannergooding commented 5 years ago

@tannergooding:

Please correct me if I'm wrong, but my understanding of SVE is that there are a number of restrictions on them including treating them as incomplete types in higher level languages (and not allowing sizeof or use in arrays, etc). Additionally, the size can be changed at execution time by modifying ZCR_ELx.LEN However, the reciever/return value is always going to be the "sizeless" type and the instructions themsleves are always the same, regardless of the "actual size"; is that correct?

I believe that differs from x86 (and therefore Vector<T>, it being cross-platform) in that on x86 you have different instructions, different types, and even different registers (even if some are subsets of others) for Vector128<T> vs Vector256<T>.

If/when we expose SVE, I would imagine it would need to be an ARM specific type (under S.R.I.Arm) rather than a cross-platform type (under S.R.Intrinsics, like Vector64<T>, Vector128<T>, and Vector256<T>) and there would need to be very specialized handling to ensure it always goes through the appropriate instructions for access, etc.

tannergooding commented 5 years ago

@TamarChristinaArm:

Please correct me if I'm wrong, but my understanding of SVE is that there are a number of restrictions on them including treating them as incomplete types in higher level languages (and not allowing sizeof or use in arrays, etc).

That's correct, the SVE ACLE types we have are incomplete, sizeless and definite.

Additionally, the size can be changed at execution time by modifying ZCR_ELx.LEN

Not by anything user-mode. While technically the kernel is allowed to change it, the expectation is that it won't do this after the process has started. You can have processes with different VLs on the same system though.

However, the reciever/return value is always going to be the "sizeless" type and the instructions themsleves are always the same, regardless of the "actual size"; is that correct?

Yeah, "sizeless" of a specific type. So the way I saw it was that Vector<T> would represent all the SVE sizeless types, e.g. svint8_t would be Vector<Int8>. As in, you don't know how large the vector is but you do know it's element size.

SVE also allows compiling code for a specific VL, in which case in ACLE you are then allowed to cast between the incomplete and a known complete type.

I believe that differs from x86 (and therefore Vector, it being cross-platform) in that on x86 you have different instructions, different types, and even different registers (even if some are subsets of others) for Vector128 vs Vector256.

Ah, wait, I think I'm missing something here.. Is Vector<T> not an actual type but a name for the "grouping" of all Vector*<T> types? But yes for SVE you wouldn't have different function or registers for different VLs.

If/when we expose SVE, I would imagine it would need to be an ARM specific type (under S.R.I.Arm) rather than a cross-platform type (under S.R.Intrinsics, like Vector64, Vector128, and Vector256) and there would need to be very specialized handling to ensure it always goes through the appropriate instructions for access, etc.

Hmm perhaps... that said there are other ISAs other than SVE which have the same understanding as VL agnostic types. So I wouldn't necessarily say this type would need to be SVE specific.

tannergooding commented 5 years ago

Ah, wait, I think I'm missing something here.. Is Vector not an actual type

No, System.Numerics.Vector<T> is an actual type and is variable sized. It's size is determined (based on the current hardware) and fixed at process startup. Users can then determine the size using sizeof or Unsafe.SizeOf<Vector<T>>() and the number of elements via Vector<T>.Count. Due to it being a fixed size for the process, it can be used in arrays, with pointer arithmetic, etc.

While technically the kernel is allowed to change it, the expectation is that it won't do this after the process has started.

This seems to be the troublesome bit. If the size can change, its probably something that needs to be handled. It doesn't sound like something we could handle or efficiently check and so restricting the type from use in other operations would seem like the sensible thing to do.

As in, you don't know how large the vector is but you do know it's element size.

Can user code not query the current size? How is user code meant to know if it is safe to use for an array x elements in length? I would guess this is why the expectation is it won't change for the current process, since that would break any checks user-code had already made.

SVE also allows compiling code for a specific VL, in which case in ACLE you are then allowed to cast between the incomplete and a known complete type.

Could you elaborate on how this works in conjunction with the maximum size and that size being allowed to change? What happens if the user explicitly targets 256-bit vectors and the machine hardware is set to 512-bit mode? What happens if the user targets 256-bit and the machine is set to (or is changed to) 128-bit mode?

So I wouldn't necessarily say this type would need to be SVE specific.

Right, I was initially thinking ARM specific (not SVE specific) rather than being a "shared" type like Vector128<T> (which works on x86 as well). The troublesome part is architecture specific behaviors.

CarolEidt commented 5 years ago

While technically the kernel is allowed to change it, the expectation is that it won't do this after the process has started.

This seems to be the troublesome bit. If the size can change, its probably something that needs to be handled. It doesn't sound like something we could handle or efficiently check and so restricting the type from use in other operations would seem like the sensible thing to do.

I'm not sure I understand in what we this would need to be handled. I think we should assert that it will not be changed within a process. If we can't ensure that (and verify it), then it seems like these types would be extremely problematic to use.

TamarChristinaArm commented 5 years ago

While technically the kernel is allowed to change it, the expectation is that it won't do this after the process has started.

This seems to be the troublesome bit. If the size can change, its probably something that needs to be handled. It doesn't sound like something we could handle or efficiently check and so restricting the type from use in other operations would seem like the sensible thing to do.

I'm not sure I understand in what we this would need to be handled. I think we should assert that it will not be changed within a process. If we can't ensure that (and verify it), then it seems like these types would be extremely problematic to use.

Changing the VL won't be something any sane implementation would do without the request having come from something in the application itself. You have the prctl extensions which would allow a program to request a change in VL, but this comes with a lot of risks if you do it after the program has been running for a while.

You could have spilled vector registers for instance and now you'd load them up incorrectly. Technically changing the VL at runtime is undefined behaviour and at least the Linux kernel won't do so without being told to.

tannergooding commented 5 years ago

I'm not sure I understand in what we this would need to be handled.

Sorry, by "handled" I meant that we need to either explicitly fail or say its UB (which might result in program corruption, etc).

If we can't ensure that (and verify it), then it seems like these types would be extremely problematic to use.

Right, and I don't think we can actively ensure that it works that way. It would likely be prohibitively expensive to repeatedly check the size/etc.

CarolEidt commented 5 years ago

Right, and I don't think we can actively ensure that it works that way. It would likely be prohibitively expensive to repeatedly check the size/etc.

That might be something that could be done periodically in a checked runtime?

tannergooding commented 5 years ago

That might be something that could be done periodically in a checked runtime?

Right, but I don't think it would ever be available in production.

TamarChristinaArm commented 5 years ago

Ah, wait, I think I'm missing something here.. Is Vector not an actual type

No, System.Numerics.Vector<T> is an actual type and is variable sized. It's size is determined (based on the current hardware) and fixed at process startup. Users can then determine the size using sizeof or Unsafe.SizeOf<Vector<T>>() and the number of elements via Vector<T>.Count. Due to it being a fixed size for the process, it can be used in arrays, with pointer arithmetic, etc.

If the prctl extensions aren't exposed to users then you can make this assemption, though all bets are off with P/Invoke though. Since technically a native library can do anything behind the JITs back.

Can user code not query the current size? How is user code meant to know if it is safe to use for an array x elements in length? I would guess this is why the expectation is it won't change for the current process, since that would break any checks user-code had already made.

Users shouldn't be writing any VL specific code at all. Typically the user would want to do an operation but shouldn't care whether it's done as a 256-bits or 512-bit vector. In ACLE you can't store them in Arrays, that's one of the limitations. With a JIT I think you can probably allow this since you don't need to know the VL, you just need to know how many bytes to store/read, which you can do using RDVL or ADDVL etc.

The same way you'd have to do if you have to spill vectors. You don't actually need to know the VL (unless I'm missing something about the implementation).

SVE also allows compiling code for a specific VL, in which case in ACLE you are then allowed to cast between the incomplete and a known complete type.

Could you elaborate on how this works in conjunction with the maximum size and that size being allowed to change? What happens if the user explicitly targets 256-bit vectors and the machine hardware is set to 512-bit mode? What happens if the user targets 256-bit and the machine is set to (or is changed to) 128-bit mode?

It doesn't work, It'll either segfault or silently produce the wrong result. It's only meant to be used to allow the compiler to generate some more efficient sequences, but then you must only run it on that VL and there are no checks performed.

The common use case is HPC where you know the exact hardware you're going to be running on and so can target it directly.

tannergooding commented 5 years ago

With a JIT I think you can probably allow this since you don't need to know the VL

I think for most algorithms, users need to know how many elements they are processing. That is, if you have an array of T, you can only process up to array.Length % SVE<T>.Count elements before you need to fallback to handle trailing elements (as it isn't safe to read the remaining elements into a vector).

So the average user code looks something like this: https://source.dot.net/#System.Private.CoreLib/shared/System/SpanHelpers.Byte.cs,470, where you loop processing Count elements at a time and then handle the trailing elements (when you have less than Count elements remaining) separately.

TamarChristinaArm commented 5 years ago

With a JIT I think you can probably allow this since you don't need to know the VL

I think for most algorithms, users need to know how many elements they are processing. That is, if you have an array of T, you can only process up to array.Length % SVE<T>.Count elements before you need to fallback to handle trailing elements (as it isn't safe to read the remaining elements into a vector).

No, that's the wrong way to think about VLA code. In your example here the load will return to you how many elements it's loaded and that is save to work on.

So the average user code looks something like this: https://source.dot.net/#System.Private.CoreLib/shared/System/SpanHelpers.Byte.cs,470, where you loop processing Count elements at a time and then handle the trailing elements (when you have less than Count elements remaining) separately.

The way it works in SVE is that you would set up a predicate register that keeps track of how many lanes are active in the SVE vector. in the SVE ACLE these are the svbool_t types.

Your "while" loop sets this up using an SVE while (like whilelt , while less then) instruction (or any other instruction that can construct or update the predicate) and the hardware takes care of executing the right amount of operations.

So with SVE you wouldn't need a scalar loop at all. Say your array is 40 bytes long and your VL is 256-bits. In your first iteration of the loop you will do 32-bytes and the predicate will have all bits set (i.e all lanes active).

In the second iteration the predicate will only be partially set. Enough to process the remaining 8-bytes of data.

So you don't need to know the VL, nor need to a trailing scalar loop. The only thing the users need to know is what they've always known, their termination criteria for the loop.

This white-paper gives a short introduction of what kind of codegen SVE should do compared to e.g. NEON https://developer.arm.com/-/media/Arm%20Developer%20Community/Images/White%20Paper%20and%20Webinar%20Images/HPC%20White%20Papers/a-sneak-peek-into-sve-and-vla-programming.pdf?revision=5abd0d7b-e853-4e96-931b-4d18b2273813