AdamNiederer / faster

SIMD for humans
Mozilla Public License 2.0
1.56k stars 51 forks source link

Surprising benchmark numbers #51

Open ralfbiedert opened 6 years ago

ralfbiedert commented 6 years ago

While working on #47 I noticed what looks like performance regressions in the cargo bench, in particular functions like map_simd and map_scalar, but quite a few others.

test tests::map_scalar                                ... bench:       2,022 ns/iter (+/- 264)
test tests::map_simd                                  ... bench:       6,898 ns/iter (+/- 392)

However, comparing #49 to the commit before the refactoring, the numbers are mostly unchanged.

I then assumed it's related to unfortunate default feature flags on my machine, but playing with avx2 and sse4.1 didn't have any effect either. I also have a first implementation of #48, and it actually looks like no fallbacks are emitted for map_simd. (Tried to cross check that with radare2, but have some problems locating the right symbol / disassembly for the benchmarks). Lastly, the functions map_scalar and map_simd differ a bit, but even when I make them equal (e.g., sqrt vs. rsqrt) the difference remains.

Running on rustc 1.29.0-nightly (9fd3d7899 2018-07-07), MBP 2015, i7-5557U.

Update: I linked the latest faster version from my SVM library and I don't see these problems in 'production':

csvm_predict_sv1024_attr1024_problems1 ... bench:     232,109 ns/iter (+/- 20,808) [faster AVX2]
csvm_predict_sv1024_attr1024_problems1 ... bench:     942,925 ns/iter (+/- 64,156) [scalar]

Update 2 Seems to be related to some intrinsics. When I dissect the benchmark, I get

test tests::map_scalar                                ... bench:         558 ns/iter (+/- 55) [without .abs()]
test tests::map_scalar                                ... bench:         556 ns/iter (+/- 33) [with .abs()]
test tests::map_simd                                  ... bench:         144 ns/iter (+/- 17) [without .abs()]
test tests::map_simd                                  ... bench:         883 ns/iter (+/- 64) [with .abs()]

I now think that each intrinsic should have its own benchmark, e.g. intrinsic_abs_scalar, intrinsic_abs_simd, ...

Update 3 ... oh boy. I think that by "arcane magic" Rust imports and prefers std::simd::f32x4 and friends over the faster types and methods.

So when you do my_f32s.abs(), it calls std::simd::f32x4::abs, not faster::arch::current::intrin::abs.

The reason I think that's the problem is you can now easily do my_f32s.sqrte(), which isn't implemented in faster, but in std::simd.

What's more annoying is that it doesn't warn about any collision, and that std::simd is actually slower than "vanilla" Rust.

TODO:

Update 4 Now one more thing makes sense ... I sometimes got use of unstable library feature 'stdsimd' in test cases and I didn't understand why. Probably because that's where the std::simd built-ins were used.

ralfbiedert commented 6 years ago

Unless I'm mistaken this doesn't look like an easy fix: std::simd types such as float32x4 seem to impl functions like abs() directly. That means we can't easily re-implement them in traits, as by default the std::simd variant will be picked.

The solutions I see:

Will put this on hold until further discussed (but will PR some current improvements made in process).

AdamNiederer commented 6 years ago

Hm, I do want faster to seamlessly integrate with std::simd (as they have much more manpower than I do), but I can definitely see these name conflicts becoming a long-term problem as std::simd adds features - especially because I likely won't know about them until rustc starts warning me.

A good argument for moving away from those types would be that we could implement std::simd's traits on our types, but that'd be a manual process and I'd likely have more work to do to get it compiling on stable.