Closed Osveron closed 6 years ago
The assertions look good, but I'm not seeing any perf differences on the benches I have. Are there any scenarios where you've seen a speedup?
Osveron/zip
test tests::zip_nop_scalar ... bench: 972 ns/iter (+/- 3)
test tests::zip_nop_simd ... bench: 430 ns/iter (+/- 0)
test tests::zip_scalar ... bench: 1,200 ns/iter (+/- 54)
test tests::zip_simd ... bench: 771 ns/iter (+/- 7)
AdamNiederer/master
test tests::zip_nop_scalar ... bench: 971 ns/iter (+/- 32)
test tests::zip_nop_simd ... bench: 430 ns/iter (+/- 0)
test tests::zip_scalar ... bench: 1,164 ns/iter (+/- 2)
test tests::zip_simd ... bench: 746 ns/iter (+/- 1)
Strange, I get nice improvement on my platform in:
tests::determinant2_simd 779 663 -116 -14.89% x 1.17
tests::determinant3_simd 929 828 -101 -10.87% x 1.12
tests::zip_nop_simd 792 486 -306 -38.64% x 1.63
and in my application that only zips two vectors I get another 0.8 sec improvement.
I've re-benched on my desktop, and it looks like the results are much more pronounced on SSE (and on machines which stay at the same frequency throughout the whole suite).
Osveron/zip
test tests::base100_enc_simd ... bench: 319 ns/iter (+/- 3)
test tests::determinant2_simd ... bench: 275 ns/iter (+/- 4)
test tests::determinant3_simd ... bench: 286 ns/iter (+/- 1)
test tests::zip_nop_simd ... bench: 257 ns/iter (+/- 0)
test tests::zip_simd ... bench: 1,157 ns/iter (+/- 5)
AdamNiederer/master
test tests::base100_enc_simd ... bench: 328 ns/iter (+/- 1)
test tests::determinant2_simd ... bench: 486 ns/iter (+/- 3)
test tests::determinant3_simd ... bench: 521 ns/iter (+/- 3)
test tests::zip_nop_simd ... bench: 437 ns/iter (+/- 3)
test tests::zip_simd ... bench: 1,194 ns/iter (+/- 9)
Is it possible to zip together PackedStripe with normal iterator over slice? Because if not then I will send additional optimization.
Yeah, that should currently be possible. I'm going to emulate SIMDArray
in PackedStripe
soon-ish using gathers and strided loads, if that's what the optimization is based on.
I have pushed the optimization to zip branch in my fork, so You can take a look. It basically makes only the first iterator keep track of the position. It gives bigger speed up than I anticipated:
tests::determinant2_simd 667 486 -181 -27.14% x 1.37
tests::determinant3_simd 825 471 -354 -42.91% x 1.75
tests::zip_nop_simd 486 426 -60 -12.35% x 1.14
tests::zip_simd 1,422 1,156 -266 -18.71% x 1.23
but since PackedStripe and other iterators keep track of the position differently, then if you zip the two together it would possibly lead to undefined behavior.
Okay, I'll see if I can get PackedStripe to virtualize its position somehow. Those are definitely compelling benchmarks, even if I have to special-case something.
I've incorporated your patch along with a small refactor. It looks like my refactor regresses performance slightly in comparison to the patch, but it does let you mix iterator types together seamlessly.
I have added check for length of the iterators on zip creation, so it panics with an information that zipped iterator need to have the same size, because currently it panicked for different reasons in debug and release mode. I have optimized zip by making it only check the first iterator, if it returns
Some
then all the zipped iterators should returnSome
or if it returnsNone
all others should return that too.I have also added debug assertions in unsafe load and store functions, and make use of them in IntoScalar functions.