Open ExpHP opened 6 years ago
That's definitely on the roadmap. I was working on something similar around the 0.4.0 release, actually. As a stopgap, have you tried using the standard map
and reduce
and handling any remaining scalars with the original function? It's not super pretty, but it should do approximately what you want.
Ah! I think that's kind of surprising how the SIMD iterators implement Iterator
. (though given the circumstances, it is certainly convenient!)
Eventually I managed to write this:
pub fn dot<K: AsKetRef>(self, other: K) -> Rect {
use ::faster::prelude::*;
let other = other.as_ket_ref();
assert_eq!(self.real.len(), other.real.len());
assert_eq!(self.real.len(), other.imag.len());
assert_eq!(self.real.len(), self.imag.len());
let zero = (f64s(0.0), f64s(0.0));
let add = |(ar, ai): (f64s, f64s), (br, bi): (f64s, f64s)| {
(ar + br, ai + bi)
};
// Performs `a.conj * b` on simd vectors
let simd_map_func = |(ar, ai, br, bi): (f64s, f64s, f64s, f64s)| {
let real = ar * br + ai * bi;
let imag = ar * bi - ai * br;
(real, imag)
};
let mut iter = (
// (the defaults of zero here may show up in the unaligned remainder,
// where they will harmlessly "contribute" to the sum)
self.real.simd_iter(f64s(0.0)),
self.imag.simd_iter(f64s(0.0)),
other.real.simd_iter(f64s(0.0)),
other.imag.simd_iter(f64s(0.0)),
).zip();
let partial_sums = {
// deal with aligned elements
iter.by_ref()
.map(simd_map_func)
.fold(zero, add)
};
// deal with unaligned remainder
let leftovers = iter.end().map(|(vs, _)| vs).map(simd_map_func).unwrap_or(zero);
let (real, imag) = add(partial_sums, leftovers);
let (real, imag) = (real.sum(), imag.sum());
Rect { real, imag }
}
It is pretty ugly (considering the original code was once 5 lines!), but it does gain back that much needed performance:
benchmark change in runtime compared to original
faster (two pass) faster (one pass)
dot/16 -35.183% -46.340%
dot/64 -35.610% -47.443%
dot/256 -25.074% -42.206%
dot/1024 -30.914% -47.855%
dot/4096 -29.428% -36.277%
dot/8192 -19.152% -41.920%
dot/16384 -10.937% -43.468%
dot/32768 -6.077% -43.786%
dot/65536 -5.440% -41.627%
I am excited to hear that something to this effect is planned! I'm curious, does the design include a trait that could be implemented by external types besides tuples? e.g. I'm picturing making Rect
generic and having my map function return a Rect<f64s>
.
(That said, I also recall from earlier design attempts of my own that these sort of "structure-of-array" style traits are difficult to design, so I'm really just wondering whether you managed to crack that nut :wink:)
Eugh, sincerest apologies for abandoning that feature. Faster is really good at computations which spit out a single array right now, and the regular iterator functions are a stopgap until I figure out an intuitive way to do n outputs.
It's possible that returning a user-definable trait might be the missing piece to the puzzle, though. I haven't approached it from that angle before and that should let me shift the burden of interpreting the data coming in and out of the closures to the user.
It's possible that returning a user-definable trait might be the missing piece to the puzzle, though. I haven't approached it from that angle before and that should let me shift the burden of interpreting the data coming in and out of the closures to the user.
Fair warning: More or less every time I've tried something like it (not for SIMD obviously, but really just any sort of "SoA <-> AoS" conversion), I've found myself journeying down a rabbit hole of design, as I'd keep finding more and more things that needed trait methods with no easy way to derive them or define them in terms of others.
(But don't let that stop you! It could be that you have a few pieces of the puzzle that I lacked. :slightly_smiling_face:)
I prototyped an idea using an HList-like type to reduce boilerplate in custom user impls:
https://github.com/ExpHP/rust-zip-simd-idea
An example of a custom user type can be seen in the tests of packed.rs
. (tuples can also use the same machinery, I just haven't written a macro to do it yet). Unfortunately the trait the user needs to implement currently has an astounding 9 associated items. (and to think I call this a reduction of boilerplate!!!)
My approach was to make the Packed
trait itself support custom types. I'm not sure this was a good idea due to some ugly typelevel-programming hacks necessary to make it work. There's more info in the README.
Unfortunately, I stopped once I got to the iterators as I was having trouble deciding how to separate the functionality there.
It's like I said before; designing a feature like this is a terrible time and energy sink. Ideally one might dream of it leading to a more orthogonal and cleanly-separated API, but if you don't draw the lines correctly, you end up needing to patch it up with typelevel programming and etc. that end up making it more confusing and difficult to learn.
But maybe it will give you some ideas.
I have a function which looks vaguely like this:
Converting it to use
faster
requires two passes over the arrays; I am unable to produce bothreal
andimag
in one pass becausesimd_map
requires the function output to be a single vector:So is it faster? Well, actually, yes! It is plenty faster... up to a point:
Yikes! Once we hit 16384 elements there is almost no speedup!
I suspect it is because at this point, memory has become the bottleneck, and most of what was gained by using SIMD was lost by making two passes over the arrays. It would be nice to have an API that allowed this do be done in one pass by allowing a mapping function to return a tuple (producing a new
PackedZippedIterator
or similar).