AdamNiederer / faster

SIMD for humans
Mozilla Public License 2.0
1.56k stars 51 forks source link

Redesign with a better design #31

Open bill-myers opened 6 years ago

bill-myers commented 6 years ago

The current design of "faster" has several big flaws that should be addressed, including:

  1. It makes no effort to use aligned loads and no effort to support aligned Vec, which is detrimental to performance on several architectures
  2. It abuses the Iterator interface, since the next() method only iterates the element in the first part that is a multiple of the SIMD width, and then you are supposed to call end() for the final part. This is a wrong use of Iterator since next() is supposed to return all items
  3. Zip is very limited since it only supports vectors of the same length, and also cannot zip e.g. u8x16 and u32x4 in a way that results in 4 u8s and 4 u32s at once
  4. A "default" element is required just to create SIMD iterators, which is probably not the best design
  5. The SIMDIterator interface has a bunch of methods like "vector_pos()", "vector_len()", "scalar_len()", etc. that are inappropriate for a SIMD version of Iterator, since the position/length is not a concept valid for general iterators (only for slice iterators)
  6. It has no support for multiple vector sizes
  7. The SIMDZipped traits are redundant, and can be removed by just implementing Packed and Packable for tuples and using the normal SIMD traits

Here is a better design:

  1. Introduce a Partial\<T> type that represent a partially filled SIMD vector, containing a vector and the number of elements that are valid (and that supports iteration, map(), and_then()
  2. Implementing Packed and Packable for tuples and remove the SIMDZipped* traits in favor of just using the normal traits with tuple types
  3. Implement Iterator with Item = Partial\<T> for SIMD iterators, so that the next() interface returns all items properly
  4. Implement slice iterators so they return a partial vector to align the iterator, then return full vectors until the final partial vector, and so that aligned loads are used
  5. Remove the current SIMDIterator and add a new SIMDIterator that provides a next_n(n: usize) -> Partial\<T> method that returns a partial SIMD vector filled with exactly n elements or less if the end of the iterator is reached, and a size_hint() in terms of scalars
  6. Remove the SIMDArray and UnsafeIterator traits
  7. Add helpers that can reduce the SIMD size of an iterator, and that can realign the iterator
  8. Add a SIMDExactSizeIterator that specifies that an exact size in terms of scalar is valid for the iterator
  9. Use specialization to implement SIMDIterator and SIMDExactSizeIterator for any Iterator<Partial>, while still allowing to override the implementation for things like slice operator
  10. Provide a SIMDVec that is like Vec but guarantees sufficient alignment for allocations (alternatively, change Vec to always align to SIMD alignment, although this might be undesirable)
  11. Support all vector sizes including smaller ones when a bigger size is supported
  12. Add a version of simd_iter() that allows to specify the vector size, and add a macro that allows to instantiate a code block for each possible SIMD instruction set for the platform using it and that uses runtime feature detection to dispatch (this would be an alternative to the preferred approach of compiling the binary for each SIMD instruction set, which is better overall but that some people may dislike due to code size issues).

With this design:

  1. map(), etc. from Iterator can be used directly on SIMD iterators.
  2. zip() can be implemented by first automatically reducing the vector size to the smallest width, and then calling next() on the first iterator (or next_n() when implementing next_n()), and then calling next_n() on the other iterators with the number of elements the first one returned
  3. A more sophisticated version of zip() could be provided that finds the most frequent alignment among all iterators, aligns all the other with next_n() and then continues zipping with next_n(WIDTH) on all iterators. It's not clear whether this is much more useful than the simpler version that just follows the first iterator though.
AdamNiederer commented 6 years ago

Hi,

Thanks for filing this issue! I'm always looking for ways to improve faster's user-facing API. Here are some quick thoughts on some of the recommendations:

It abuses the Iterator interface, since the next() method only iterates the element in the first part that is a multiple of the SIMD width, and then you are supposed to call end() for the final part. This is a wrong use of Iterator since next() is supposed to return all items

I'd argue that this is a good thing, as it forces users of the "manual" API to unroll their partial-vector behaviour. I'm concerned that something like Partial<T> as described would encourage a lot of needless branching with little benefit to the end user.

Zip is very limited since it only supports vectors of the same length, and also cannot zip e.g. u8x16 and u32x4 in a way that results in 4 u8s and 4 u32s at once

While I can imagine this would be a reasonable pattern in a more sophisticated vector ISA, I can't think of why one would want to do this with, say, AVX. Do you have an algorithm in mind which could make use of this, or a motivation for using this instead of upcasting the u8s before zipping them?

The SIMDZipped traits are redundant, and can be removed by just implementing Packed and Packable for tuples and using the normal SIMD traits

I remember this definitely not making sense when I added the zipping API, but I think our internal abstractions have improved since then. We can probably do this.

The SIMDIterator interface has a bunch of methods like "vector_pos()", "vector_len()", "scalar_len()", etc. that are inappropriate for a SIMD version of Iterator, since the position/length is not a concept valid for general iterators (only for slice iterators)

SIMD iterators are usually slice iterators, are they not? I see no harm in exposing this, because there is no situation in which we don't know it.

Implement slice iterators so they return a partial vector to align the iterator, then return full vectors until the final partial vector, and so that aligned loads are used

This is a very good idea, but I wonder if it would cause x86 perf to regress significantly for smaller collections, as their misaligned load penalty is usually minuscule.

Remove the SIMDArray and UnsafeIterator traits

The SIMDArray abstraction is used to good effect in our strided iterators, so I'm hesitant to remove it. I'm in the process of removing UnsafeIterator, but I have to nail down a few perf regressions on SSE systems before pushing it upstream.

Use specialization

I'd like to wait until specialization is fully stable before I start using it in core parts of the library.

bill-myers commented 6 years ago

I'd argue that this is a good thing, as it forces users of the "manual" API to unroll their partial-vector behaviour. I'm concerned that something like Partial as described would encourage a lot of needless branching with little benefit to the end user.

Well, for map() operations there is no need to branch, you just do the operation on the partial vector and pass the amount of elements through.

For reduce operations (e.g. summing all elements) and collection, there could indeed be an inefficiency, although if one represents Partial as a Full/Partial enum, then the iterator will return Option<Partial> and the new Rust's data type optimizations should coalesce the Option and Partial discriminator bytes allowing to test for Some(Full(_)) with a single test (but LLVM needs to somehow be told that this is the most likely case).

However there is a more general problem, which is needing a branch on every element to test if we are at the end of the iterator, which happens both in the current design and with Partial. The solution to this is to have, potentially in addition to Partial-returning iterators, iterators that return a "chunk" type, which would consist of 2^N vectors and a item count (this could either be a Partial<[VecType; 16]>, extending Partial and Packed to support arrays, or an ad-hoc type), and then would have map() and reduce functions that would special case the "full" case and call the provided closure in an unrolled loop in that case, using a normal loop otherwise (in the reduce case an extra closure to deal with the final partial case would probably be needed).

Obviously the slice iterator adapter would also have a single check for whether chunk_size elements are available and then load them all in an unrolled loop (this design also could allow to cacheline-size-align the chunks instead of just vector-size-aligning them, although I think that's probably useless and in fact harmful since it might cause zip() to have extra overhead for realigning).

So you would do something like iter.simd_chunk_iterator().map(|x| x.map(|y| my_func(y))) or iter.simd_chunk_iterator()>.reduce(|a, x| x.reduce(a, |y| my_func_on_vec(a, y), |y| my_func_on_partial(a, y))), and possibly have simd_map() and simd_reduce() helper methods that would allow reduce this to a single call to them avoiding the inner map/reduce calls.

Zip is very limited since it only supports vectors of the same length, and also cannot zip e.g. u8x16 and u32x4 in a way that results in 4 u8s and 4 u32s at once

While I can imagine this would be a reasonable pattern in a more sophisticated vector ISA, I can't think of why one would want to do this with, say, AVX. Do you have an algorithm in mind which could make use of this, or a motivation for using this instead of upcasting the u8s before zipping them?

Well, in faster you don't know how many elements are in an "u8s" or "f32s" type (and there's no relationship between them because e.g. u8s changes from AVX1-only to AVX2 targets while f32s doesn't - or at least it would if faster had proper support for AVX1-only CPUs), so I don't think there is a way to "upcast" them, without specifically providing an helper to do so (and to determine the type of the result, i.e. what is probably need is to have a WidthType associated type in Packable that is a lightweight "type-level integer", and then do something like VectorTypeWithWidth<u8, MinWidth<u8s::WidthType, f32s::WidthType>::WidthType> to get a vector of u8 with the minimum width of u8s and 32s).

Regarding zip(), it's necessary to have the same number of elements zipped together so that SIMD width is completely abstracted, so if you are zipping an u8x16 and an f32x4 you need somehow to convert the u8x16 iterator into one returning either a bunch of u8x4 or a bunch of partial u8x16, hence it makes sense for zip() to call such an adapter itself, so everything "just works" regardless of actual vector sizes.

This might also mean adding support for "reduced vectors", e.g. a type that contains a f32x4 but for which only 2 elements are valid (this would differ from Partial in that the number of valid elements is a compile type constant) - using f32x2 could be an option but would probably lead to suboptimal code if it's not natively supported.

The SIMDIterator interface has a bunch of methods like "vector_pos()", "vector_len()", "scalar_len()", etc. that are inappropriate for a SIMD version of Iterator, since the position/length is not a concept valid for general iterators (only for slice iterators)

SIMD iterators are usually slice iterators, are they not? I see no harm in exposing this, because there is no situation in which we don't know it.

Well, you can convert any iterator returning a scalar type into a SIMD iterator (by reading scalars and putting them in a vector) so SIMD iterators are just as generic as normal iterators, so there doesn't seem to be any reason to prioritize slice iterators.

Obviously an extra SIMD "random access"/"index" iterator trait can and probably should be provided, although I'm not sure if it's good to make adapters like map/zip/etc. preserve that as well, since normal iterators don't do that and it might not be very useful anyway (since it's probably bad to recompute things if random access reference a single index multiple times, so collecting is probably better).

I'd like to wait until specialization is fully stable before I start using it in core parts of the library.

I think a design without specialization should be possible, although it would probably require the user to explicitly call a function to e.g. readapt a SIMD iterator that Iterator::map has been called on to again support the SIMD iterator interface, rather than just having a blanket impl of SIMDIterator for Iterator.