AdamNiederer / faster

SIMD for humans
Mozilla Public License 2.0
1.56k stars 51 forks source link

Optimize zip #21

Closed Osveron closed 6 years ago

Osveron commented 6 years ago

I have added check for length of the iterators on zip creation, so it panics with an information that zipped iterator need to have the same size, because currently it panicked for different reasons in debug and release mode. I have optimized zip by making it only check the first iterator, if it returns Some then all the zipped iterators should return Some or if it returns None all others should return that too.

I have also added debug assertions in unsafe load and store functions, and make use of them in IntoScalar functions.

AdamNiederer commented 6 years ago

The assertions look good, but I'm not seeing any perf differences on the benches I have. Are there any scenarios where you've seen a speedup?

Osveron/zip
test tests::zip_nop_scalar        ... bench:         972 ns/iter (+/- 3)
test tests::zip_nop_simd          ... bench:         430 ns/iter (+/- 0)
test tests::zip_scalar            ... bench:       1,200 ns/iter (+/- 54)
test tests::zip_simd              ... bench:         771 ns/iter (+/- 7)

AdamNiederer/master
test tests::zip_nop_scalar        ... bench:         971 ns/iter (+/- 32)
test tests::zip_nop_simd          ... bench:         430 ns/iter (+/- 0)
test tests::zip_scalar            ... bench:       1,164 ns/iter (+/- 2)
test tests::zip_simd              ... bench:         746 ns/iter (+/- 1)
Osveron commented 6 years ago

Strange, I get nice improvement on my platform in:

 tests::determinant2_simd      779              663                     -116  -14.89%   x 1.17
 tests::determinant3_simd      929              828                     -101  -10.87%   x 1.12
 tests::zip_nop_simd           792              486                     -306  -38.64%   x 1.63

and in my application that only zips two vectors I get another 0.8 sec improvement.

AdamNiederer commented 6 years ago

I've re-benched on my desktop, and it looks like the results are much more pronounced on SSE (and on machines which stay at the same frequency throughout the whole suite).

Osveron/zip
test tests::base100_enc_simd      ... bench:         319 ns/iter (+/- 3)
test tests::determinant2_simd     ... bench:         275 ns/iter (+/- 4)
test tests::determinant3_simd     ... bench:         286 ns/iter (+/- 1)
test tests::zip_nop_simd          ... bench:         257 ns/iter (+/- 0)
test tests::zip_simd              ... bench:       1,157 ns/iter (+/- 5)

AdamNiederer/master
test tests::base100_enc_simd      ... bench:         328 ns/iter (+/- 1)
test tests::determinant2_simd     ... bench:         486 ns/iter (+/- 3)
test tests::determinant3_simd     ... bench:         521 ns/iter (+/- 3)
test tests::zip_nop_simd          ... bench:         437 ns/iter (+/- 3)
test tests::zip_simd              ... bench:       1,194 ns/iter (+/- 9)
Osveron commented 6 years ago

Is it possible to zip together PackedStripe with normal iterator over slice? Because if not then I will send additional optimization.

AdamNiederer commented 6 years ago

Yeah, that should currently be possible. I'm going to emulate SIMDArray in PackedStripe soon-ish using gathers and strided loads, if that's what the optimization is based on.

Osveron commented 6 years ago

I have pushed the optimization to zip branch in my fork, so You can take a look. It basically makes only the first iterator keep track of the position. It gives bigger speed up than I anticipated:

 tests::determinant2_simd      667              486                     -181  -27.14%   x 1.37
 tests::determinant3_simd      825              471                     -354  -42.91%   x 1.75
 tests::zip_nop_simd           486              426                      -60  -12.35%   x 1.14
 tests::zip_simd               1,422            1,156                   -266  -18.71%   x 1.23

but since PackedStripe and other iterators keep track of the position differently, then if you zip the two together it would possibly lead to undefined behavior.

AdamNiederer commented 6 years ago

Okay, I'll see if I can get PackedStripe to virtualize its position somehow. Those are definitely compelling benchmarks, even if I have to special-case something.

AdamNiederer commented 6 years ago

I've incorporated your patch along with a small refactor. It looks like my refactor regresses performance slightly in comparison to the patch, but it does let you mix iterator types together seamlessly.