mapping a zipped iterator to produce tuples

ExpHP commented 6 years ago

I have a function which looks vaguely like this:

struct Rect { real: f64, imag: f64 }
struct KetRef<'a> { real: &'a [f64], imag: &'a [f64] }

impl<'a> KetRef<'a> {
    pub fn dot(self, other: KetRef) -> Rect {
        assert_eq!(self.real.len(), other.real.len());
        assert_eq!(self.real.len(), other.imag.len());
        assert_eq!(self.real.len(), self.imag.len());
        zip!(self.real, self.imag, other.real, other.imag)
            .map(|(ar, ai, br, bi)| {
                let real = ar * br + ai * bi;
                let imag = ar * bi - ai * br;
                Rect { real, imag }
            })
            .fold(Rect::zero(), |a,b| a + b)
    }
}

Converting it to use faster requires two passes over the arrays; I am unable to produce both real and imag in one pass because simd_map requires the function output to be a single vector:

pub fn dot<K: AsKetRef>(self, other: K) -> Rect {
    use ::faster::prelude::*;

    let other = other.as_ket_ref();
    assert_eq!(self.real.len(), other.real.len());
    assert_eq!(self.real.len(), other.imag.len());
    assert_eq!(self.real.len(), self.imag.len());

    let real = (
        self.real.simd_iter(f64s(0.0)),
        self.imag.simd_iter(f64s(0.0)),
        other.real.simd_iter(f64s(0.0)),
        other.imag.simd_iter(f64s(0.0)),
    ).zip().simd_map(|(ar, ai, br, bi)| {
        ar * br + ai * bi
    }).simd_reduce(f64s(0.0), |acc, v| acc + v).sum();

    let imag = (
        self.real.simd_iter(f64s(0.0)),
        self.imag.simd_iter(f64s(0.0)),
        other.real.simd_iter(f64s(0.0)),
        other.imag.simd_iter(f64s(0.0)),
    ).zip().simd_map(|(ar, ai, br, bi)| {
        ar * bi - ai * br
    }).simd_reduce(f64s(0.0), |acc, v| acc + v).sum();

    Rect { real, imag }
}

So is it faster? Well, actually, yes! It is plenty faster... up to a point:

Change in run-time for different ket lengths
dot/16              change: -33.973%
dot/64              change: -29.575%
dot/256             change: -26.762%
dot/1024            change: -34.054%
dot/4096            change: -36.297%
dot/16384           change: -7.3379%

Yikes! Once we hit 16384 elements there is almost no speedup!

I suspect it is because at this point, memory has become the bottleneck, and most of what was gained by using SIMD was lost by making two passes over the arrays. It would be nice to have an API that allowed this do be done in one pass by allowing a mapping function to return a tuple (producing a new PackedZippedIterator or similar).

AdamNiederer commented 6 years ago

That's definitely on the roadmap. I was working on something similar around the 0.4.0 release, actually. As a stopgap, have you tried using the standard map and reduce and handling any remaining scalars with the original function? It's not super pretty, but it should do approximately what you want.

ExpHP commented 6 years ago

Ah! I think that's kind of surprising how the SIMD iterators implement Iterator. (though given the circumstances, it is certainly convenient!)

Eventually I managed to write this:

pub fn dot<K: AsKetRef>(self, other: K) -> Rect {
    use ::faster::prelude::*;

    let other = other.as_ket_ref();
    assert_eq!(self.real.len(), other.real.len());
    assert_eq!(self.real.len(), other.imag.len());
    assert_eq!(self.real.len(), self.imag.len());

    let zero = (f64s(0.0), f64s(0.0));
    let add = |(ar, ai): (f64s, f64s), (br, bi): (f64s, f64s)| {
        (ar + br, ai + bi)
    };
    // Performs `a.conj * b` on simd vectors
    let simd_map_func = |(ar, ai, br, bi): (f64s, f64s, f64s, f64s)| {
        let real = ar * br + ai * bi;
        let imag = ar * bi - ai * br;
        (real, imag)
    };

    let mut iter = (
        // (the defaults of zero here may show up in the unaligned remainder,
        //  where they will harmlessly "contribute" to the sum)
        self.real.simd_iter(f64s(0.0)),
        self.imag.simd_iter(f64s(0.0)),
        other.real.simd_iter(f64s(0.0)),
        other.imag.simd_iter(f64s(0.0)),
    ).zip();

    let partial_sums = {
        // deal with aligned elements
        iter.by_ref()
            .map(simd_map_func)
            .fold(zero, add)
    };
    // deal with unaligned remainder
    let leftovers = iter.end().map(|(vs, _)| vs).map(simd_map_func).unwrap_or(zero);
    let (real, imag) = add(partial_sums, leftovers);
    let (real, imag) = (real.sum(), imag.sum());

    Rect { real, imag }
}

It is pretty ugly (considering the original code was once 5 lines!), but it does gain back that much needed performance:

 benchmark    change in runtime compared to original
              faster (two pass)     faster (one pass)
dot/16            -35.183%              -46.340%
dot/64            -35.610%              -47.443%
dot/256           -25.074%              -42.206%
dot/1024          -30.914%              -47.855%
dot/4096          -29.428%              -36.277%
dot/8192          -19.152%              -41.920%
dot/16384         -10.937%              -43.468%
dot/32768          -6.077%              -43.786%
dot/65536          -5.440%              -41.627%

I am excited to hear that something to this effect is planned! I'm curious, does the design include a trait that could be implemented by external types besides tuples? e.g. I'm picturing making Rect generic and having my map function return a Rect<f64s>.

(That said, I also recall from earlier design attempts of my own that these sort of "structure-of-array" style traits are difficult to design, so I'm really just wondering whether you managed to crack that nut :wink:)

AdamNiederer commented 6 years ago

Eugh, sincerest apologies for abandoning that feature. Faster is really good at computations which spit out a single array right now, and the regular iterator functions are a stopgap until I figure out an intuitive way to do n outputs.

It's possible that returning a user-definable trait might be the missing piece to the puzzle, though. I haven't approached it from that angle before and that should let me shift the burden of interpreting the data coming in and out of the closures to the user.

ExpHP commented 6 years ago

It's possible that returning a user-definable trait might be the missing piece to the puzzle, though. I haven't approached it from that angle before and that should let me shift the burden of interpreting the data coming in and out of the closures to the user.

Fair warning: More or less every time I've tried something like it (not for SIMD obviously, but really just any sort of "SoA <-> AoS" conversion), I've found myself journeying down a rabbit hole of design, as I'd keep finding more and more things that needed trait methods with no easy way to derive them or define them in terms of others.

(But don't let that stop you! It could be that you have a few pieces of the puzzle that I lacked. :slightly_smiling_face:)

ExpHP commented 6 years ago

I prototyped an idea using an HList-like type to reduce boilerplate in custom user impls:

https://github.com/ExpHP/rust-zip-simd-idea

An example of a custom user type can be seen in the tests of packed.rs. (tuples can also use the same machinery, I just haven't written a macro to do it yet). Unfortunately the trait the user needs to implement currently has an astounding 9 associated items. (and to think I call this a reduction of boilerplate!!!)

My approach was to make the Packed trait itself support custom types. I'm not sure this was a good idea due to some ugly typelevel-programming hacks necessary to make it work. There's more info in the README.

Unfortunately, I stopped once I got to the iterators as I was having trouble deciding how to separate the functionality there.

It's like I said before; designing a feature like this is a terrible time and energy sink. Ideally one might dream of it leading to a more orthogonal and cleanly-separated API, but if you don't draw the lines correctly, you end up needing to patch it up with typelevel programming and etc. that end up making it more confusing and difficult to learn.

But maybe it will give you some ideas.

AdamNiederer / faster

mapping a zipped iterator to produce tuples #40