Open bill-myers opened 6 years ago
Hi,
Thanks for filing this issue! I'm always looking for ways to improve faster's user-facing API. Here are some quick thoughts on some of the recommendations:
It abuses the Iterator interface, since the next() method only iterates the element in the first part that is a multiple of the SIMD width, and then you are supposed to call end() for the final part. This is a wrong use of Iterator since next() is supposed to return all items
I'd argue that this is a good thing, as it forces users of the "manual" API to unroll their partial-vector behaviour. I'm concerned that something like Partial<T>
as described would encourage a lot of needless branching with little benefit to the end user.
Zip is very limited since it only supports vectors of the same length, and also cannot zip e.g. u8x16 and u32x4 in a way that results in 4 u8s and 4 u32s at once
While I can imagine this would be a reasonable pattern in a more sophisticated vector ISA, I can't think of why one would want to do this with, say, AVX. Do you have an algorithm in mind which could make use of this, or a motivation for using this instead of upcasting the u8s before zipping them?
The SIMDZipped traits are redundant, and can be removed by just implementing Packed and Packable for tuples and using the normal SIMD traits
I remember this definitely not making sense when I added the zipping API, but I think our internal abstractions have improved since then. We can probably do this.
The SIMDIterator interface has a bunch of methods like "vector_pos()", "vector_len()", "scalar_len()", etc. that are inappropriate for a SIMD version of Iterator, since the position/length is not a concept valid for general iterators (only for slice iterators)
SIMD iterators are usually slice iterators, are they not? I see no harm in exposing this, because there is no situation in which we don't know it.
Implement slice iterators so they return a partial vector to align the iterator, then return full vectors until the final partial vector, and so that aligned loads are used
This is a very good idea, but I wonder if it would cause x86 perf to regress significantly for smaller collections, as their misaligned load penalty is usually minuscule.
Remove the SIMDArray and UnsafeIterator traits
The SIMDArray abstraction is used to good effect in our strided iterators, so I'm hesitant to remove it. I'm in the process of removing UnsafeIterator, but I have to nail down a few perf regressions on SSE systems before pushing it upstream.
Use specialization
I'd like to wait until specialization is fully stable before I start using it in core parts of the library.
I'd argue that this is a good thing, as it forces users of the "manual" API to unroll their partial-vector behaviour. I'm concerned that something like Partial
as described would encourage a lot of needless branching with little benefit to the end user.
Well, for map() operations there is no need to branch, you just do the operation on the partial vector and pass the amount of elements through.
For reduce operations (e.g. summing all elements) and collection, there could indeed be an inefficiency, although if one represents Partial
However there is a more general problem, which is needing a branch on every element to test if we are at the end of the iterator, which happens both in the current design and with Partial. The solution to this is to have, potentially in addition to Partial-returning iterators, iterators that return a "chunk" type, which would consist of 2^N vectors and a item count (this could either be a Partial<[VecType; 16]>, extending Partial and Packed to support arrays, or an ad-hoc type), and then would have map() and reduce functions that would special case the "full" case and call the provided closure in an unrolled loop in that case, using a normal loop otherwise (in the reduce case an extra closure to deal with the final partial case would probably be needed).
Obviously the slice iterator adapter would also have a single check for whether chunk_size elements are available and then load them all in an unrolled loop (this design also could allow to cacheline-size-align the chunks instead of just vector-size-aligning them, although I think that's probably useless and in fact harmful since it might cause zip() to have extra overhead for realigning).
So you would do something like iter.simd_chunk_iterator().map(|x| x.map(|y| my_func(y))) or iter.simd_chunk_iterator()>.reduce(|a, x| x.reduce(a, |y| my_func_on_vec(a, y), |y| my_func_on_partial(a, y))), and possibly have simd_map() and simd_reduce() helper methods that would allow reduce this to a single call to them avoiding the inner map/reduce calls.
Zip is very limited since it only supports vectors of the same length, and also cannot zip e.g. u8x16 and u32x4 in a way that results in 4 u8s and 4 u32s at once
While I can imagine this would be a reasonable pattern in a more sophisticated vector ISA, I can't think of why one would want to do this with, say, AVX. Do you have an algorithm in mind which could make use of this, or a motivation for using this instead of upcasting the u8s before zipping them?
Well, in faster you don't know how many elements are in an "u8s" or "f32s" type (and there's no relationship between them because e.g. u8s changes from AVX1-only to AVX2 targets while f32s doesn't - or at least it would if faster had proper support for AVX1-only CPUs), so I don't think there is a way to "upcast" them, without specifically providing an helper to do so (and to determine the type of the result, i.e. what is probably need is to have a WidthType associated type in Packable that is a lightweight "type-level integer", and then do something like VectorTypeWithWidth<u8, MinWidth<u8s::WidthType, f32s::WidthType>::WidthType> to get a vector of u8 with the minimum width of u8s and 32s).
Regarding zip(), it's necessary to have the same number of elements zipped together so that SIMD width is completely abstracted, so if you are zipping an u8x16 and an f32x4 you need somehow to convert the u8x16 iterator into one returning either a bunch of u8x4 or a bunch of partial u8x16, hence it makes sense for zip() to call such an adapter itself, so everything "just works" regardless of actual vector sizes.
This might also mean adding support for "reduced vectors", e.g. a type that contains a f32x4 but for which only 2 elements are valid (this would differ from Partial in that the number of valid elements is a compile type constant) - using f32x2 could be an option but would probably lead to suboptimal code if it's not natively supported.
The SIMDIterator interface has a bunch of methods like "vector_pos()", "vector_len()", "scalar_len()", etc. that are inappropriate for a SIMD version of Iterator, since the position/length is not a concept valid for general iterators (only for slice iterators)
SIMD iterators are usually slice iterators, are they not? I see no harm in exposing this, because there is no situation in which we don't know it.
Well, you can convert any iterator returning a scalar type into a SIMD iterator (by reading scalars and putting them in a vector) so SIMD iterators are just as generic as normal iterators, so there doesn't seem to be any reason to prioritize slice iterators.
Obviously an extra SIMD "random access"/"index" iterator trait can and probably should be provided, although I'm not sure if it's good to make adapters like map/zip/etc. preserve that as well, since normal iterators don't do that and it might not be very useful anyway (since it's probably bad to recompute things if random access reference a single index multiple times, so collecting is probably better).
I'd like to wait until specialization is fully stable before I start using it in core parts of the library.
I think a design without specialization should be possible, although it would probably require the user to explicitly call a function to e.g. readapt a SIMD iterator that Iterator::map has been called on to again support the SIMD iterator interface, rather than just having a blanket impl of SIMDIterator for Iterator
The current design of "faster" has several big flaws that should be addressed, including:
Here is a better design:
With this design: