Closed cristicbz closed 7 years ago
I'd like to note that the failure to vectorize is a regression from earlier rust versions. OTOH, SIMD for stable is on the horizon, so once we have that, we can port to it.
@llogiq any news of SIMD? Maybe an issue to follow?
The most recent work on stabilizing SIMD has been in an internals discussion. burntsushi hopes to publish a draft RFC soon.
@cristicbz do you want I take care of the submission of this bench?
@TeXitoi I was wondering if this is a good idea; it's a little bit faster, and it is cleaner. But I wanted to see if we can't fiddle with the code to convince LLVM to vectorize it again. I tried a little bit at the time, but didn't get anywhere. I've been pretty busy for a while so I haven't had a chance to look at it properly :(
Ok I let it as is for the moment, but I feel free to submit it in some days/weeks if there is no changes (or if I take some time to try to autovectorize it).
Tung Duong managed to get a version that auto-vectorizes, but had to jump through some hoops:
https://alioth.debian.org/tracker/index.php?func=detail&aid=315668&group_id=100815&atid=413122
When I build the final single-file submission on my computer with either Rust 1.16.0 or the current nightly, it does not provide any speedup versus master. Not sure if it's because of a hardware difference or if I'm missing something...
As this version is better than before, maybe the optimization was also active with this version?
I managed to get this to vectorize, by using [f64; 2]
and the div_and_add
trick (with #[inline(never)]
). It now runs at similar speed with C implementation. Please have a look and let me know if you have any comments before I submit this.
I also tried adaptin this version to use the simd
crate (on nightly) to get an idea of a lower bound, giving this:
On my computer, this versions runs at the same speed as the one in the latest commit. Sadly, just adding the following dud implementations doesn't work:
brings the runtime back to to 1.4s
. So that's annoying.
Sorry @TeXitoi I took a while to get around to this. So, I added Tung Duong to the contributors and also added a couple of comments and changed the names of the functions to get rid of the #![allow(non_snake_case)]
thing. I created a submission with this code: https://alioth.debian.org/tracker/index.php?func=detail&aid=315699&group_id=100815&atid=413122
I got rid of all the
f64x2
andusizex2
structs since the code wouldn't vectorize no matter how hard I tried. Unlike #51, I also completely get rid of chunking, letting rayon do all the work, usingpar_iter_mut
.This speeds up the code a little bit from 1.6s to 1.4s on my laptop