Closed UnlawfulMonad closed 6 years ago
If I understand, the benchmarks suffer if you pass these structures by value, but maybe one can tweak that with optimizations modes, etc. In particular, passing by value permits the recipient to mutate the value, so you stack space optimizations might inhibit doing a pass by reference behind the scenes. I'd also think making the operators #[inline(always)]
wrappers around call by reference functions should fix this, but maybe not.
When we implemented the arithmetic ops for T
and not for &T
, we had a lot of nested redundant copies: first, copying the LHS and RHS points into the function (each point is 160 bytes), plus copying the LHS and RHS of every field element operation inside (each field element is 40 bytes). It did make a difference in cutting out needless copies. However it's obviously worse ergonomically, as I'm sure you've noticed.
I didn't think that it would be possible to work around this, but it seems like @burdges' suggestion might work, at least on that toy example.
So, to answer your original question, @UnlawfulMonad, if there were implementations of the arithmetic ops that didn't require borrows but were just as fast, we would be happy to take that PR.
Hi @UnlawfulMonad!
As @hdevalence said, we'd be stoked to take a patch for this, provided it doesn't slow anything down by introducing needless copies everywhere. If you want to implement this, my suggestion for the steps would be:
#[inline(always)]
. (Bonus points for adding #[inline(always)]
to all the operator implementations!)ed25519-dalek
or dalek-rangeproofs
that is doing what I call an "Eye of Sauron" (i.e. &(&(&(&( ))))
like &(&(&s1 * &P1) + &(s2 * &P2)) + &(s3 * &P3)
or something similarly evil looking), change it to your non-borrow syntax, and compare—with various rustc optimisation levels—the assembly generated from the original code to the assembly for the new syntax to ensure there's no extra copies. https://rust.godbolt.org is really handy for this. (If you're not comfortable with assembly, feel free to paste links to the the two versions of the code in godbolt, and we'll review it.)@UnlawfulMonad Would you like me to assign you to this issue?
Hi @isislovecruft and @hdevalence!
Thanks for the suggestions. I'll get on this next week and update this issue with my progress. I assumed that inlining would already be occurring if the call by value variant were simply wrapping the call by reference versions.
Sidenote: Eye of Sauron is a great name for those things. :)
@UnlawfulMonad Awesome, thanks! It appears like if we add #[inline(always)]
also to the borrow versions, then the by-value versions inline directly to the borrow version code (i.e. so that rustc is basically adding the Eye of Sauron for you).
(Also I used to work at the FreeGeek in Portland! It is an excellent project!)
@isislovecruft Is code bloat not a concern at that point? #[inline(always)]
forces rustc/LLVM to inline. Another option would be to leave the Eyes of Sauron in the library code but gives users the option to use either impl (overgeneralization warning) since in practice the overhead in user code should be negligible. I'm going to test simply adding #[inline]
attributes and compare the overhead vs #[inline(always)]
ing all the operators.
Sorry for the lack of progress. It's been a hectic couple of weeks.
(And that's awesome! Agreed, it's an awesome project!)
Alright. Here are my results:
All readings with rustc version 1.19.0-nightly (148e91714 2017-06-08)
.
Baseline from commit 97d69740c:
test curve::bench::add_extended_and_affine_niels_output_completed ... bench: 212 ns/iter (+/- 87)
test curve::bench::add_extended_and_affine_niels_output_extended ... bench: 480 ns/iter (+/- 136)
test curve::bench::add_extended_and_projective_niels_output_completed ... bench: 275 ns/iter (+/- 6)
test curve::bench::add_extended_and_projective_niels_output_extended ... bench: 525 ns/iter (+/- 23)
test curve::bench::basepoint_mult ... bench: 35,181 ns/iter (+/- 12,430)
test curve::bench::bench_select_precomputed_point ... bench: 1,032 ns/iter (+/- 50)
test curve::bench::edwards_compress ... bench: 12,023 ns/iter (+/- 1,532)
test curve::bench::edwards_decompress ... bench: 12,837 ns/iter (+/- 2,573)
test curve::bench::extended_double_output_extended ... bench: 486 ns/iter (+/- 100)
test curve::bench::mult_by_cofactor ... bench: 1,316 ns/iter (+/- 251)
test curve::bench::projective_double_output_completed ... bench: 226 ns/iter (+/- 118)
test curve::bench::scalar_mult ... bench: 152,515 ns/iter (+/- 2,688)
test curve::bench::vartime::bench_double_scalar_mult_basepoint ... bench: 159,619 ns/iter (+/- 2,291)
test curve::bench::vartime::ten_fold_scalar_mult ... bench: 399,000 ns/iter (+/- 7,540)
test field::bench::fieldelement_a_inv ... bench: 12,200 ns/iter (+/- 1,889)
test field::bench::fieldelement_a_mul_a ... bench: 67 ns/iter (+/- 1)
test field::bench::fieldelement_a_sq ... bench: 48 ns/iter (+/- 1)
test scalar::bench::invert ... bench: 67,871 ns/iter (+/- 803)
test scalar::bench::scalar_multiply_add ... bench: 270 ns/iter (+/- 68)
test scalar::bench::scalar_random ... bench: 1,115 ns/iter (+/- 101)
test scalar::bench::scalar_unpacked_multiply_add ... bench: 199 ns/iter (+/- 12)
With #[inline]
on all operators (commit 6cdae940 on the callbyvalue
branch of my fork):
test curve::bench::add_extended_and_affine_niels_output_completed ... bench: 220 ns/iter (+/- 9)
test curve::bench::add_extended_and_affine_niels_output_extended ... bench: 493 ns/iter (+/- 28)
test curve::bench::add_extended_and_projective_niels_output_completed ... bench: 287 ns/iter (+/- 23)
test curve::bench::add_extended_and_projective_niels_output_completed_by_value ... bench: 299 ns/iter (+/- 9)
test curve::bench::add_extended_and_projective_niels_output_extended ... bench: 558 ns/iter (+/- 72)
test curve::bench::add_extended_and_projective_niels_output_extended_by_value ... bench: 593 ns/iter (+/- 33)
test curve::bench::basepoint_mult ... bench: 36,890 ns/iter (+/- 6,329)
test curve::bench::bench_select_precomputed_point ... bench: 891 ns/iter (+/- 60)
test curve::bench::edwards_compress ... bench: 12,197 ns/iter (+/- 2,828)
test curve::bench::edwards_decompress ... bench: 12,710 ns/iter (+/- 1,029)
test curve::bench::extended_double_output_extended ... bench: 493 ns/iter (+/- 14)
test curve::bench::mult_by_cofactor ... bench: 1,318 ns/iter (+/- 96)
test curve::bench::projective_double_output_completed ... bench: 220 ns/iter (+/- 10)
test curve::bench::scalar_mult ... bench: 156,671 ns/iter (+/- 10,735)
test curve::bench::scalar_mult_by_value ... bench: 163,941 ns/iter (+/- 13,653)
test curve::bench::vartime::bench_double_scalar_mult_basepoint ... bench: 160,486 ns/iter (+/- 10,722)
test curve::bench::vartime::ten_fold_scalar_mult ... bench: 402,349 ns/iter (+/- 16,962)
test field::bench::fieldelement_a_inv ... bench: 11,891 ns/iter (+/- 2,288)
test field::bench::fieldelement_a_mul_a ... bench: 65 ns/iter (+/- 2)
test field::bench::fieldelement_a_mul_a_by_value ... bench: 67 ns/iter (+/- 6)
test field::bench::fieldelement_a_sq ... bench: 47 ns/iter (+/- 3)
test scalar::bench::invert ... bench: 65,925 ns/iter (+/- 3,157)
test scalar::bench::scalar_multiply_add ... bench: 257 ns/iter (+/- 12)
test scalar::bench::scalar_random ... bench: 1,135 ns/iter (+/- 61)
test scalar::bench::scalar_unpacked_multiply_add ... bench: 195 ns/iter (+/- 23)
With #[inline(always)]
on all operators (commit 156c92d on the call by value branch of my fork):
test curve::bench::add_extended_and_affine_niels_output_completed ... bench: 208 ns/iter (+/- 6)
test curve::bench::add_extended_and_affine_niels_output_extended ... bench: 488 ns/iter (+/- 19)
test curve::bench::add_extended_and_projective_niels_output_completed ... bench: 278 ns/iter (+/- 9)
test curve::bench::add_extended_and_projective_niels_output_completed_by_value ... bench: 295 ns/iter (+/- 11)
test curve::bench::add_extended_and_projective_niels_output_extended ... bench: 550 ns/iter (+/- 20)
test curve::bench::add_extended_and_projective_niels_output_extended_by_value ... bench: 565 ns/iter (+/- 16)
test curve::bench::basepoint_mult ... bench: 36,467 ns/iter (+/- 1,810)
test curve::bench::bench_select_precomputed_point ... bench: 882 ns/iter (+/- 36)
test curve::bench::edwards_compress ... bench: 12,003 ns/iter (+/- 1,698)
test curve::bench::edwards_decompress ... bench: 12,630 ns/iter (+/- 1,812)
test curve::bench::extended_double_output_extended ... bench: 496 ns/iter (+/- 32)
test curve::bench::mult_by_cofactor ... bench: 1,331 ns/iter (+/- 32)
test curve::bench::projective_double_output_completed ... bench: 217 ns/iter (+/- 8)
test curve::bench::scalar_mult ... bench: 155,447 ns/iter (+/- 5,160)
test curve::bench::scalar_mult_by_value ... bench: 156,946 ns/iter (+/- 6,148)
test curve::bench::vartime::bench_double_scalar_mult_basepoint ... bench: 157,789 ns/iter (+/- 9,171)
test curve::bench::vartime::ten_fold_scalar_mult ... bench: 405,442 ns/iter (+/- 19,138)
test field::bench::fieldelement_a_inv ... bench: 11,806 ns/iter (+/- 1,706)
test field::bench::fieldelement_a_mul_a ... bench: 66 ns/iter (+/- 3)
test field::bench::fieldelement_a_mul_a_by_value ... bench: 67 ns/iter (+/- 3)
test field::bench::fieldelement_a_sq ... bench: 47 ns/iter (+/- 4)
test scalar::bench::invert ... bench: 72,665 ns/iter (+/- 1,737)
test scalar::bench::scalar_multiply_add ... bench: 267 ns/iter (+/- 53)
test scalar::bench::scalar_random ... bench: 1,130 ns/iter (+/- 195)
test scalar::bench::scalar_unpacked_multiply_add ... bench: 216 ns/iter (+/- 40)
Isolating a couple of the tests we get:
Baseline:
test curve::bench::scalar_mult ... bench: 152,515 ns/iter (+/- 2,688)
test curve::bench::vartime::bench_double_scalar_mult_basepoint ... bench: 159,619 ns/iter (+/- 2,291)
test curve::bench::vartime::ten_fold_scalar_mult ... bench: 399,000 ns/iter (+/- 7,540)
test field::bench::fieldelement_a_inv ... bench: 12,200 ns/iter (+/- 1,889)
test field::bench::fieldelement_a_mul_a ... bench: 67 ns/iter (+/- 1)
test field::bench::fieldelement_a_sq ... bench: 48 ns/iter (+/- 1)
#[inline]
test curve::bench::scalar_mult ... bench: 156,671 ns/iter (+/- 10,735)
test curve::bench::scalar_mult_by_value ... bench: 163,941 ns/iter (+/- 13,653)
test curve::bench::vartime::bench_double_scalar_mult_basepoint ... bench: 160,486 ns/iter (+/- 10,722)
test curve::bench::vartime::ten_fold_scalar_mult ... bench: 402,349 ns/iter (+/- 16,962)
test field::bench::fieldelement_a_inv ... bench: 11,891 ns/iter (+/- 2,288)
test field::bench::fieldelement_a_mul_a ... bench: 65 ns/iter (+/- 2)
test field::bench::fieldelement_a_mul_a_by_value ... bench: 67 ns/iter (+/- 6)
test field::bench::fieldelement_a_sq ... bench: 47 ns/iter (+/- 3)
#[inline(always)]
test curve::bench::scalar_mult ... bench: 155,447 ns/iter (+/- 5,160)
test curve::bench::scalar_mult_by_value ... bench: 156,946 ns/iter (+/- 6,148)
test curve::bench::vartime::bench_double_scalar_mult_basepoint ... bench: 157,789 ns/iter (+/- 9,171)
test curve::bench::vartime::ten_fold_scalar_mult ... bench: 405,442 ns/iter (+/- 19,138)
test field::bench::fieldelement_a_inv ... bench: 11,806 ns/iter (+/- 1,706)
test field::bench::fieldelement_a_mul_a ... bench: 66 ns/iter (+/- 3)
test field::bench::fieldelement_a_mul_a_by_value ... bench: 67 ns/iter (+/- 3)
test field::bench::fieldelement_a_sq ... bench: 47 ns/iter (+/- 4)
Note: the by_value
variants of tests modify the test to use the call by value variants. They don't change anything else. I didn't include any by_value
variants to any tests that broke just by removing ampersands.
My assembly-fu isn't at the point where I can tell exactly what changed between runs but a cursory look suggests that it's very similar if not identical. Are these acceptable results?
I started looking at this using cargo benchcmp on a machine with Turbo Boost disabled, so that the timings are actually consistent. I'm not entirely clear on what's going on, but hopefully I can get a clearer picture on the weekend or next week (I just wanted to note that I didn't forget about it).
I compared the existing implementation with two variants of this trick: first, just having #[inline(always)]
on the move/copy operators without #[inline]
on all the other methods; second, having #[inline(always)]
on the move/copy operators with the additional #[inline]
s.
Roughly speaking, the results I got were a slight across-the-board decline in performance for the first option, which ends up inserting a bunch of extra copies. For the second option, there's a slight increase in performance for some functions, and some microbenchmarks, but there's a decrease in performance for the inversion used in decompression. (I believe this is because it's inlined all ~260 field operations).
What I suspect is happening here is that the extra inline annotations are interacting with the cost model in unexpected and not always positive ways. So while, on a single example, using #[inline(always)]
on the move/copy operator works out, when it's done across the board there are unexpected interactions. It almost works -- but I think a cleaner solution would be to make arithmetic operators in Rust work the way the .
operator does.
However, for point operations, which are what people who use -dalek
want to work with, the cost of a copy is tiny compared to the cost of the operation. For those operations, it doesn't matter whether someone uses the move operator or the borrow operator. So I think that it makes sense to keep the operator-generating macros for those, at least until we can fix Rust.
Hmm. I'm interested in the idea of arithmetic operators behaving like operator .
. If there hasn't already been discussion on this topic can I steal the idea and see if I can propose this as a (Rust) RFC?
There is more serious discussion on that topic now in https://github.com/rust-lang/rfcs/pull/2147
Just to circle back on this, the refactoring in #86 more cleanly separates the internal API from the external API. I think the fix for this issue is to just implement operators for both T
and &T
for all public types T
, using something like @UnlawfulMonad's macro idea.
Closing this issue since it's fixed in #101
I've noticed that operators on
ExtendedPoint
and the like are only implemented on&'a ExtendedPoint
. Is there any reason for a PR that also implemented them for the concrete types wouldn't be accepted? The types are already copy and as far as I know the compile is better at reasoning about the code when there isn't explicit memory (i.e. a reference) involved.