TeXitoi / benchmarksgame-rs

The Computer Language Benchmarks Game: Rust implementations
Other
70 stars 19 forks source link

spectralnorm: rayon and de-vectorize #58

Closed cristicbz closed 7 years ago

cristicbz commented 7 years ago

I got rid of all the f64x2 and usizex2 structs since the code wouldn't vectorize no matter how hard I tried. Unlike #51, I also completely get rid of chunking, letting rayon do all the work, using par_iter_mut.

This speeds up the code a little bit from 1.6s to 1.4s on my laptop

llogiq commented 7 years ago

I'd like to note that the failure to vectorize is a regression from earlier rust versions. OTOH, SIMD for stable is on the horizon, so once we have that, we can port to it.

TeXitoi commented 7 years ago

@llogiq any news of SIMD? Maybe an issue to follow?

mbrubeck commented 7 years ago

The most recent work on stabilizing SIMD has been in an internals discussion. burntsushi hopes to publish a draft RFC soon.

TeXitoi commented 7 years ago

@cristicbz do you want I take care of the submission of this bench?

cristicbz commented 7 years ago

@TeXitoi I was wondering if this is a good idea; it's a little bit faster, and it is cleaner. But I wanted to see if we can't fiddle with the code to convince LLVM to vectorize it again. I tried a little bit at the time, but didn't get anywhere. I've been pretty busy for a while so I haven't had a chance to look at it properly :(

TeXitoi commented 7 years ago

Ok I let it as is for the moment, but I feel free to submit it in some days/weeks if there is no changes (or if I take some time to try to autovectorize it).

mbrubeck commented 7 years ago

Tung Duong managed to get a version that auto-vectorizes, but had to jump through some hoops:

https://alioth.debian.org/tracker/index.php?func=detail&aid=315668&group_id=100815&atid=413122

When I build the final single-file submission on my computer with either Rust 1.16.0 or the current nightly, it does not provide any speedup versus master. Not sure if it's because of a hardware difference or if I'm missing something...

TeXitoi commented 7 years ago

As this version is better than before, maybe the optimization was also active with this version?

cristicbz commented 7 years ago

I managed to get this to vectorize, by using [f64; 2] and the div_and_add trick (with #[inline(never)]). It now runs at similar speed with C implementation. Please have a look and let me know if you have any comments before I submit this.

I also tried adaptin this version to use the simd crate (on nightly) to get an idea of a lower bound, giving this:

```rust #![feature(cfg_target_feature)] extern crate simd; extern crate rayon; use rayon::prelude::*; #[cfg(target_feature = "sse2")] use simd::x86::sse2::f64x2; #[cfg(target_arch = "aarch64")] use simd::aarch64::neon::f64x2; #[cfg(target_feature = "sse2")] use simd::x86::sse2::u64x2; #[cfg(target_arch = "aarch64")] use simd::aarch64::neon::u64x2; fn main() { let n = std::env::args().nth(1) .and_then(|n| n.parse().ok()) .unwrap_or(100); let answer = spectralnorm(n); println!("{:.9}", answer); } fn spectralnorm(n: usize) -> f64 { assert!(n % 2 == 0, "only even lengths are accepted"); let mut u = vec![f64x2::splat(1.0); n / 2]; let mut v = vec![f64x2::splat(0.0); n / 2]; let mut tmp = vec![f64x2::splat(0.0); n / 2]; for _ in 0..10 { mult_AtAv(&u, &mut v, &mut tmp); mult_AtAv(&v, &mut u, &mut tmp); } (dot(&u, &v) / dot(&v, &v)).sqrt() } fn mult_AtAv(v: &[f64x2], out: &mut [f64x2], tmp: &mut [f64x2]) { mult(v, tmp, A); mult(tmp, out, |i, j| A(j, i)); } fn mult(v: &[f64x2], out: &mut [f64x2], a: F) where F: Fn(u64x2, u64x2) -> f64x2 + Sync { out.par_iter_mut().enumerate().for_each(|(i, slot)| { let i = 2 * i as u64; let i0 = u64x2::splat(i); let i1 = u64x2::splat(i + 1); let slots = v.iter().enumerate().map(|(j, &x)| { let j = u64x2::new(2 * j as u64, 2 * j as u64 + 1); (x * a(i0, j), x * a(i1, j)) }).fold((f64x2::splat(0.0), f64x2::splat(0.0)), |s, x| (s.0 + x.0, s.1 + x.1)); *slot = f64x2::new(slots.0.extract(0) + slots.0.extract(1), slots.1.extract(0) + slots.1.extract(1)); }); } fn A(i: u64x2, j: u64x2) -> f64x2 { let s = i + j; let r = ((s * (s + u64x2::splat(1)) >> 1u64) + i + u64x2::splat(1)).to_f64(); f64x2::splat(1.0) / r } fn dot(v: &[f64x2], u: &[f64x2]) -> f64 { let r = u.iter() .zip(v) .map(|(&x, &y)| x * y) .fold(f64x2::splat(0.0), |s, x| s + x); r.extract(0) + r.extract(1) } ```

On my computer, this versions runs at the same speed as the one in the latest commit. Sadly, just adding the following dud implementations doesn't work:

```rust use std::ops::{Add, Shr, Mul, Div, Sub}; #[derive(Copy, Clone, Debug, Default)] pub struct f64x2(f64, f64); #[derive(Copy, Clone, Debug, Default)] pub struct u64x2(u64, u64); impl f64x2 { pub fn new(x: f64, y: f64) -> f64x2 { f64x2(x, y) } pub fn splat(x: f64) -> f64x2 { f64x2(x, x) } pub fn extract(&self, i: u8) -> f64 { match i { 0 => self.0, 1 => self.1, _ => unreachable!(), } } } impl Add for f64x2 { type Output = Self; fn add(self, rhs: Self) -> Self { f64x2(self.0 + rhs.0, self.1 + rhs.1) } } impl Sub for f64x2 { type Output = Self; fn sub(self, rhs: Self) -> Self { f64x2(self.0 - rhs.0, self.1 - rhs.1) } } impl Mul for f64x2 { type Output = Self; fn mul(self, rhs: Self) -> Self { f64x2(self.0 * rhs.0, self.1 * rhs.1) } } impl Div for f64x2 { type Output = Self; fn div(self, rhs: Self) -> Self { f64x2(self.0 / rhs.0, self.1 / rhs.1) } } impl u64x2 { pub fn new(x: u64, y: u64) -> u64x2 { u64x2(x, y) } pub fn splat(x: u64) -> u64x2 { u64x2(x, x) } pub fn to_f64(self) -> f64x2 { f64x2(self.0 as f64, self.1 as f64) } } impl Add for u64x2 { type Output = Self; fn add(self, rhs: Self) -> Self { u64x2(self.0 + rhs.0, self.1 + rhs.1) } } impl Sub for u64x2 { type Output = Self; fn sub(self, rhs: Self) -> Self { u64x2(self.0 - rhs.0, self.1 - rhs.1) } } impl Mul for u64x2 { type Output = Self; fn mul(self, rhs: Self) -> Self { u64x2(self.0 * rhs.0, self.1 * rhs.1) } } impl Div for u64x2 { type Output = Self; fn div(self, rhs: Self) -> Self { u64x2(self.0 / rhs.0, self.1 / rhs.1) } } impl Shr for u64x2 { type Output = Self; fn shr(self, rhs: u64) -> Self { u64x2(self.0 >> rhs, self.1 >> rhs) } } ```

brings the runtime back to to 1.4s. So that's annoying.

cristicbz commented 7 years ago

Sorry @TeXitoi I took a while to get around to this. So, I added Tung Duong to the contributors and also added a couple of comments and changed the names of the functions to get rid of the #![allow(non_snake_case)] thing. I created a submission with this code: https://alioth.debian.org/tracker/index.php?func=detail&aid=315699&group_id=100815&atid=413122