Open penzn opened 4 years ago
Developers made do quite successfully without it on native hardware, why is it a must for Wasm?
IMHO, I think it is a must for flexible vectors (much less for WASM SIMD in general). If you know the size of your SIMD register
I am surprised to hear it described as a must, haven't yet seen an application that really required it. We do know the size of the register, right? We have to have some function that tells us the loop increment, which is (by definition) the register size.
We do know the size of the register, right? We have to have some function that tells us the loop increment, which is (by definition) the register size.
It is seems to be a misconception here: We do not know, as developers, the SIMD width. We only know how to get it at runtime via a specific instruction (or global).
For instance, one way to emulate scatter (or gather for that matter) is to implement a full in-register transposition. This means that, at some point, you need as many registers as their width to store the data. If you transpose floats in SSE, you need 4 registers. With AVX2, 8 registers, and so on. So you cannot do this trick if you don't know at code time (or compile time) the size of the registers.
Even the extract pattern for scatter would be problematic as the extract index would most likely be an immediate (and certainly is on most architectures). So here, either you unroll completely the loop at compile time and check for each index that it is less than actual width, or we make the extract with runtime indices and hope the generation will see that the index is actually compile-time...
Neither solution sounds appealing.
Yes, to be clear: having the function could allow us to compare the runtime value against a small set of candidates, and use the corresponding code pregenerated for each. Which raises an interesting question: is there some abstraction we can provide that allows developers to know that SVE will always have n*128 bit, x86 will have {1,2,4}x128? RiscV V has no such limitation, but if the function returns something the app doesn't expect (e.g. 16K bits) then the app can fall back to some codepath that doesn't do in-register transposition.
For Arm and x86 ISAs it would be perfectly legal to say that maximum width is always a multiple of 128 bits, though I am not sure how that would map to RiscV.
According to Risc-V V spec (https://riscv.github.io/documents/riscv-v-spec/riscv-v-spec.pdf#_implementation_defined_constant_parameters), maximum width should be a power of larger than or equal to 32 bits. (EDIT: I got confused in a previous version of this message)
Only SVE does not require that maximum width should be a power of 2.
There's also SimpleV, a WIP extension on OpenPower that guarantees availability of any vector length from 1 to 64 (not limited to powers of 2, so e.g. 35 is a valid vector length), and allows (like RISC-V V) the length to be set dynamically. It supports gather-load, scatter-store, and gather register-to-register moves.
Thanks @programmerjake, I was not aware of SimpleV. To me, the interesting point is the guarantee that vector length of 64 is available. So on SimpleV, we can force the vl to a power of 2 if required.
Also, you mention "gather register-to-register moves". While in hardware, it makes sense to group it with gather loads, for the software point of view, such connection is not required and the terminology used is more shuffle/swizzle.
Some additional comments on SimpleV on Libre-SOC's mailing list: http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-April/002318.html
Thanks @programmerjake, I was not aware of SimpleV. To me, the interesting point is the guarantee that vector length of 64 is available.
Yup! We basically picked 64 as the max since the general purpose integer registers are 64-bits wide allowing 1 predicate bit per vector element.
So on SimpleV, we can force the vl to a power of 2 if required.
Yup, though if your forcing VL to be bigger than necessary just so it's a power of 2, it will probably run slower, since it's implemented using a hardware-level loop over vector elements.
Also, you mention "gather register-to-register moves". While in hardware, it makes sense to group it with gather loads, for the software point of view, such connection is not required and the terminology used is more shuffle/swizzle.
Yup!
As @Maratyszcza and @lemaitre point out in #7, we should consider scatter and gather operations. This is an issue to track that.
Potential topics to discuss: