Open tannergooding opened 5 years ago
@TamarChristinaArm:
It would seem that if one were doing interop with vector types, one would be better off with the fixed-size HW intrinsic types, and if one wants to interop with Vector
one should convert to/from fixed-size HW intrinsic types. Sorry, haven't been following the discussion much, but wouldn't this cause a problem in the future with something like SVE? where the user really doesn't know the size of the Vector?
It's probably not relevant for now but just wondering.
@tannergooding:
Please correct me if I'm wrong, but my understanding of SVE is that there are a number of restrictions on them including treating them as incomplete types in higher level languages (and not allowing
sizeof
or use in arrays, etc). Additionally, the size can be changed at execution time by modifyingZCR_ELx.LEN
However, the reciever/return value is always going to be the "sizeless" type and the instructions themsleves are always the same, regardless of the "actual size"; is that correct?I believe that differs from x86 (and therefore
Vector<T>
, it being cross-platform) in that on x86 you have different instructions, different types, and even different registers (even if some are subsets of others) forVector128<T>
vsVector256<T>
.If/when we expose SVE, I would imagine it would need to be an ARM specific type (under
S.R.I.Arm
) rather than a cross-platform type (underS.R.Intrinsics
, likeVector64<T>
,Vector128<T>
, andVector256<T>
) and there would need to be very specialized handling to ensure it always goes through the appropriate instructions for access, etc.
@TamarChristinaArm:
Please correct me if I'm wrong, but my understanding of SVE is that there are a number of restrictions on them including treating them as incomplete types in higher level languages (and not allowing sizeof or use in arrays, etc).
That's correct, the SVE ACLE types we have are incomplete, sizeless and definite.
Additionally, the size can be changed at execution time by modifying ZCR_ELx.LEN
Not by anything user-mode. While technically the kernel is allowed to change it, the expectation is that it won't do this after the process has started. You can have processes with different VLs on the same system though.
However, the reciever/return value is always going to be the "sizeless" type and the instructions themsleves are always the same, regardless of the "actual size"; is that correct?
Yeah, "sizeless" of a specific type. So the way I saw it was that
Vector<T>
would represent all the SVE sizeless types, e.g.svint8_t
would beVector<Int8>
. As in, you don't know how large the vector is but you do know it's element size.SVE also allows compiling code for a specific VL, in which case in ACLE you are then allowed to cast between the incomplete and a known complete type.
I believe that differs from x86 (and therefore Vector
, it being cross-platform) in that on x86 you have different instructions, different types, and even different registers (even if some are subsets of others) for Vector128 vs Vector256 . Ah, wait, I think I'm missing something here.. Is
Vector<T>
not an actual type but a name for the "grouping" of allVector*<T>
types? But yes for SVE you wouldn't have different function or registers for different VLs.If/when we expose SVE, I would imagine it would need to be an ARM specific type (under S.R.I.Arm) rather than a cross-platform type (under S.R.Intrinsics, like Vector64
, Vector128 , and Vector256 ) and there would need to be very specialized handling to ensure it always goes through the appropriate instructions for access, etc. Hmm perhaps... that said there are other ISAs other than SVE which have the same understanding as VL agnostic types. So I wouldn't necessarily say this type would need to be SVE specific.
Ah, wait, I think I'm missing something here.. Is Vector
not an actual type
No, System.Numerics.Vector<T>
is an actual type and is variable sized. It's size is determined (based on the current hardware) and fixed at process startup. Users can then determine the size using sizeof
or Unsafe.SizeOf<Vector<T>>()
and the number of elements via Vector<T>.Count
. Due to it being a fixed size for the process, it can be used in arrays, with pointer arithmetic, etc.
While technically the kernel is allowed to change it, the expectation is that it won't do this after the process has started.
This seems to be the troublesome bit. If the size can change, its probably something that needs to be handled. It doesn't sound like something we could handle or efficiently check and so restricting the type from use in other operations would seem like the sensible thing to do.
As in, you don't know how large the vector is but you do know it's element size.
Can user code not query the current size? How is user code meant to know if it is safe to use for an array x
elements in length? I would guess this is why the expectation is it won't change for the current process, since that would break any checks user-code had already made.
SVE also allows compiling code for a specific VL, in which case in ACLE you are then allowed to cast between the incomplete and a known complete type.
Could you elaborate on how this works in conjunction with the maximum size and that size being allowed to change? What happens if the user explicitly targets 256-bit vectors and the machine hardware is set to 512-bit mode? What happens if the user targets 256-bit and the machine is set to (or is changed to) 128-bit mode?
So I wouldn't necessarily say this type would need to be SVE specific.
Right, I was initially thinking ARM specific (not SVE specific) rather than being a "shared" type like Vector128<T>
(which works on x86 as well). The troublesome part is architecture specific behaviors.
While technically the kernel is allowed to change it, the expectation is that it won't do this after the process has started.
This seems to be the troublesome bit. If the size can change, its probably something that needs to be handled. It doesn't sound like something we could handle or efficiently check and so restricting the type from use in other operations would seem like the sensible thing to do.
I'm not sure I understand in what we this would need to be handled. I think we should assert that it will not be changed within a process. If we can't ensure that (and verify it), then it seems like these types would be extremely problematic to use.
While technically the kernel is allowed to change it, the expectation is that it won't do this after the process has started.
This seems to be the troublesome bit. If the size can change, its probably something that needs to be handled. It doesn't sound like something we could handle or efficiently check and so restricting the type from use in other operations would seem like the sensible thing to do.
I'm not sure I understand in what we this would need to be handled. I think we should assert that it will not be changed within a process. If we can't ensure that (and verify it), then it seems like these types would be extremely problematic to use.
Changing the VL won't be something any sane implementation would do without the request having come from something in the application itself. You have the prctl extensions
which would allow a program to request a change in VL, but this comes with a lot of risks if you do it after the program has been running for a while.
You could have spilled vector registers for instance and now you'd load them up incorrectly. Technically changing the VL at runtime is undefined behaviour and at least the Linux kernel won't do so without being told to.
I'm not sure I understand in what we this would need to be handled.
Sorry, by "handled" I meant that we need to either explicitly fail or say its UB (which might result in program corruption, etc).
If we can't ensure that (and verify it), then it seems like these types would be extremely problematic to use.
Right, and I don't think we can actively ensure that it works that way. It would likely be prohibitively expensive to repeatedly check the size/etc.
Right, and I don't think we can actively ensure that it works that way. It would likely be prohibitively expensive to repeatedly check the size/etc.
That might be something that could be done periodically in a checked runtime?
That might be something that could be done periodically in a checked runtime?
Right, but I don't think it would ever be available in production.
Ah, wait, I think I'm missing something here.. Is Vector not an actual type
No,
System.Numerics.Vector<T>
is an actual type and is variable sized. It's size is determined (based on the current hardware) and fixed at process startup. Users can then determine the size usingsizeof
orUnsafe.SizeOf<Vector<T>>()
and the number of elements viaVector<T>.Count
. Due to it being a fixed size for the process, it can be used in arrays, with pointer arithmetic, etc.
If the prctl
extensions aren't exposed to users then you can make this assemption, though all bets are off with P/Invoke
though. Since technically a native library can do anything behind the JITs back.
Can user code not query the current size? How is user code meant to know if it is safe to use for an array
x
elements in length? I would guess this is why the expectation is it won't change for the current process, since that would break any checks user-code had already made.
Users shouldn't be writing any VL specific code at all. Typically the user would want to do an operation but shouldn't care whether it's done as a 256-bits or 512-bit vector. In ACLE you can't store them in Arrays, that's one of the limitations. With a JIT I think you can probably allow this since you don't need to know the VL, you just need to know how many bytes to store/read, which you can do using RDVL
or ADDVL
etc.
The same way you'd have to do if you have to spill vectors. You don't actually need to know the VL (unless I'm missing something about the implementation).
SVE also allows compiling code for a specific VL, in which case in ACLE you are then allowed to cast between the incomplete and a known complete type.
Could you elaborate on how this works in conjunction with the maximum size and that size being allowed to change? What happens if the user explicitly targets 256-bit vectors and the machine hardware is set to 512-bit mode? What happens if the user targets 256-bit and the machine is set to (or is changed to) 128-bit mode?
It doesn't work, It'll either segfault or silently produce the wrong result. It's only meant to be used to allow the compiler to generate some more efficient sequences, but then you must only run it on that VL and there are no checks performed.
The common use case is HPC where you know the exact hardware you're going to be running on and so can target it directly.
With a JIT I think you can probably allow this since you don't need to know the VL
I think for most algorithms, users need to know how many elements they are processing. That is, if you have an array of T
, you can only process up to array.Length % SVE<T>.Count
elements before you need to fallback to handle trailing elements (as it isn't safe to read the remaining elements into a vector).
So the average user code looks something like this: https://source.dot.net/#System.Private.CoreLib/shared/System/SpanHelpers.Byte.cs,470, where you loop processing Count
elements at a time and then handle the trailing elements (when you have less than Count
elements remaining) separately.
With a JIT I think you can probably allow this since you don't need to know the VL
I think for most algorithms, users need to know how many elements they are processing. That is, if you have an array of
T
, you can only process up toarray.Length % SVE<T>.Count
elements before you need to fallback to handle trailing elements (as it isn't safe to read the remaining elements into a vector).
No, that's the wrong way to think about VLA code. In your example here the load will return to you how many elements it's loaded and that is save to work on.
So the average user code looks something like this: https://source.dot.net/#System.Private.CoreLib/shared/System/SpanHelpers.Byte.cs,470, where you loop processing
Count
elements at a time and then handle the trailing elements (when you have less thanCount
elements remaining) separately.
The way it works in SVE is that you would set up a predicate register that keeps track of how many lanes are active in the SVE vector. in the SVE ACLE these are the svbool_t
types.
Your "while" loop sets this up using an SVE while
(like whilelt
, while less then) instruction (or any other instruction that can construct or update the predicate) and the hardware takes care of executing the right amount of operations.
So with SVE you wouldn't need a scalar loop at all. Say your array is 40 bytes
long and your VL is 256-bits
. In your first iteration of the loop you will do 32-bytes
and the predicate will have all bits set (i.e all lanes active).
In the second iteration the predicate will only be partially set. Enough to process the remaining 8-bytes
of data.
So you don't need to know the VL, nor need to a trailing scalar loop. The only thing the users need to know is what they've always known, their termination criteria for the loop.
This white-paper gives a short introduction of what kind of codegen SVE should do compared to e.g. NEON https://developer.arm.com/-/media/Arm%20Developer%20Community/Images/White%20Paper%20and%20Webinar%20Images/HPC%20White%20Papers/a-sneak-peek-into-sve-and-vla-programming.pdf?revision=5abd0d7b-e853-4e96-931b-4d18b2273813
Forked off from: https://github.com/dotnet/coreclr/pull/23899#issuecomment-551881728
The .NET Runtime may eventually want to support the ARM SVE Extensions. These types have some interesting characteristics that may be worth further discussion.