Closed ycscaly closed 1 month ago
See also: #67
See also: #67
Is there a reason it wasn't completed/merged? I see activity stopped a year ago. We would like to help move this forward actually.
The author of #67 (@pdogr) became unresponsive. The PR is also out of date and in need of a rebase, and contained undesirable breaking changes which have open comment threads
The author of #67 (@pdogr) became unresponsive. The PR is also out of date and in need of a rebase, and contained undesirable breaking changes which have open comment threads
I am willing to pick up that work if you'll have time reviewing it. I'll have a look at what he's done, and either continue the effort or start from scratch.
Sounds good
Diving into this again, we have a few insights:
mul_wide
's purposes. This is true taking into account that basic operations are on Limb
which is 64-bit and not on bits. Since the Uint
s used are small factors of Limb
(i.e. in the hundreds and not thousands or more) Karatsuba (i.e. setting k = 2
for Toom-Cook) is optimal. (This means that the optimization we discussed in the newly released article by my colleague is irrelevant here.)Limb
size, which got us to multiplying big-ints composed of thousands of limbs (e.g. U4096
being 4096 bits) but when we accounted Limb
we got to singles or dozens (e.g. U4096
being 64 Limb
s). This means that the optimization is much smaller than we have originally thought (e.g. ~2x improvement for 4096-bit multiplication.)This also de-motivates us a little since the improvement is not as big as we originally thought, so this goes back to our backlog but perhaps we could find time to do this in the next few months.
Implementing this should now arrive at results on-par with rust-gmp
instead of exceeding it, which makes sense (again, due to our error in calculation.)
Residue
or Montgomery in general; the fact that rem
is (order of magnitude) faster then montgomery_reduction
, and, as far as I understand it, could replace it changing correctness, still remains unexplained.As I said before, mul_montgomery_form
does a mul_wide
followed by a montgomery_reduction
. If instead we would do a mul_wide
followed by a rem
we could gain a ~2x factor very easily. Why not do it? And what are the benefits of using Montgomery in the first place, if working over the integers is faster? I must be missing something, but thought it would be better to ask it here.
And what are the benefits of using Montgomery in the first place, if working over the integers is faster?
Montgomery form is typically useful when performing many operations which are faster in Montgomery form before converting back into canonical form, so the overhead of the conversions in/out of Montgomery form are outweighed by the savings.
There are often better algorithms if you're performing a single operation.
And what are the benefits of using Montgomery in the first place, if working over the integers is faster?
Montgomery form is typically useful when performing many operations which are faster in Montgomery form before converting back into canonical form, so the overhead of the conversions in/out of Montgomery form are outweighed by the savings.
There are often better algorithms if you're performing a single operation.
I see the logic, it's just that in practice (at least for multiplication) that's not the case, as there is a Montgomery reduction happening in every operation in the current code.
I think that's unavoidable without lazy reductions?
And what are the benefits of using Montgomery in the first place, if working over the integers is faster?
Montgomery form is typically useful when performing many operations which are faster in Montgomery form before converting back into canonical form, so the overhead of the conversions in/out of Montgomery form are outweighed by the savings. There are often better algorithms if you're performing a single operation.
I see the logic, it's just that in practice (at least for multiplication) that's not the case, as there is a Montgomery reduction happening in every operation in the current code.
After re-writing and running my benchmarking I of course found out you are right, and the error was on my side. Sorry for that.
In that case, it seems that in order to optimize mul_montogmery_form()
we are left solely with implementing Karatsuba, which we can expect a 25% improvement from our previous calculation (2x
improvement of mul_wide()
, which is about half of the time mul_montgomery_form()
takes.)
I should probably note we do have a Barrett reduction implementation here, but it's specialized to P-256's limb count: https://github.com/RustCrypto/elliptic-curves/blob/d1ea1a3/p256/src/arithmetic/scalar/scalar64.rs
What's the current status of this? Is there already a working implementation of Karatsuba multiplication or does that still need to be written?
As of #649 we now support Karatsuba. Closing.
We currently use the inefficient schoolbook multiplication for
mul_wide
. It would be much better to use an optimized algorithm, e.g. Karatsuba. There's already a TODO left in the source code to do so.This is important for our purposes as optimizing
mul_wide()
will optimizemul_montgomery_form()
inDynResidue
andResidue
, which is a bottleneck forpow()
which is, in turn, a bottleneck for our system. In fact,mul_montgomery_form()
is just amul_wide()
followed by amontgomery_reduction()
.I evaluated experimentally and benchmark results showed that running time for
mul_wide()
andmontgomery_reduction()
are equal at roughly4µs
, :From our calculation, Karatsuba improves by a factor of
12.62x
. This would leave the main bottleneck formul_montgomery_form()
to bemontgomery_reduction()
, achieving a2x
improvement. However, it also seems that it would actually be cheaper to perform amul_wide()
followed by anrem
- which is8x
cheaper thanmontgomery_reduction()
- over the integers, which could actually leaverem
as the bottleneck, achieving roughly8x
improvement ofmul_montgomery_form()
! This should translate directly to a similar improvement inpow_montgomery_form()
, which is very significant. (Actually, from my benchmarking, a 4096-bit mod-pow costs35ms
in crypto-bigint now, and13ms
inrust-gmp
, so either I'm missing something or we can out-achieve them by3x
.)Another issue is that Karatsuba is not constant time, but it can be made constant time by using the lazy interpolation method. My colleague Yuval Spiizer (@ylkylk1) has in fact did his thesis on this matter, and in his paper (submitted and under review, not yet published) he improved upon the state-of-the-art by a 10% change asymptomatically by modifying Toom Cook's algorithm (which is a generalization of Karatsuba) - and his algorithm is constant time as well.
We can therefore help with this process.