Emill / P256-cortex-ecdh

P256 ECDH for Cortex-M0 and Cortex-M4
BSD 2-Clause "Simplified" License
21 stars 5 forks source link

Scalar field? #1

Open nickray opened 4 years ago

nickray commented 4 years ago

Hi Emil,

first of all thanks a lot for releasing these nice UMAAL-based implementations of the base field of P256 under a permissive license! I'm building a Rust implementation of P256 ECDH/ECDSA around them, using your speed optimized P256_{add,sub,mul,sqr}mod routines as computational core.

I now find myself in the strange situation where the entire ephemeral (public) point calculation in ECDSA is twice as fast as inverting the ephemeral scalar k with Euler's theorem and my own Barrett reduction-based Rust implementation :man_facepalming:. Did you ever put thought into speeding up the scalar field, or know of an existing UMAAL-based implementation for it? I think any "N256_mulmod" assembly routine would give a major speed bump, even if not completely optimized - my lack of assembly skills are currently preventing me from adapting your P256_mulmod to n = 0xffffffff00000000ffffffffffffffffbce6faada7179e84f3b9cac2fc632551.

Emill commented 4 years ago

I have a complete Ed25519 + ECDSA implementation that I have not had time to release yet :/

I suggest however to use Stein's algorithm on k * r, to get 1 / (k * r) and then multiply by r. The r here is a cryptographically secure random number. Due to the random structure of n, this will be much faster than doing Fermat's theorem. As long as the two multiplications can be done "securely", it doesn't matter if a hacker finds out what k * r is.

May I ask for which target platform you make a Rust implementation? Is it Cortex-M4 or more meant for Cortex-A processors?

nickray commented 4 years ago

I have a complete Ed25519 + ECDSA implementation that I have not had time to release yet :/

By this you mean 99% assembly? :)

I suggest however to use Stein's algorithm on k * r, to get 1 / (k * r) and then multiply by r. The r here is a cryptographically secure random number. Due to the random structure of n, this will be much faster than doing Fermat's theorem. As long as the two multiplications can be done "securely", it doesn't matter if a hacker finds out what k * r is.

Thanks for the rapid answer! I've seen this approach motivated with "masking scalar", assumed the point is to get around non-constant time nature; having a name for the approach is very helpful.

May I ask for which target platform you make a Rust implementation? Is it Cortex-M4 or more meant for Cortex-A processors?

Target is Cortex-M4/M33 (specifically NXP LPC55S69), use case the rewrite of SoloKeys (FIDO2/PIV) in Rust. The github.com/RustCrypto organization covers symmetric crypto/hashes quite well, but for asymmetric crypto I've found a dearth of usable+understandable implementations. By this I mean that the high-level structure should be easy to understand mathematically and follow in code, while the actual field arithmetic should be super fast and use UMAAL.

Previously I put together github.com/ycrypto/salty which merges Haase's fe25519 arithmetic (for his CPace/AuCPace PAKEs) with basically TweetNaCl, I intend to do the same with github.com/ycrypto/nisty (which currently just wraps micro-ecc, and for which on a high level there's github.com/RustCrypto/elliptic-curves/tree/master/p256 as prior art).

I'm actually not sure if your or Haase's fe25519 (code: https://github.com/BjoernMHaase/fe25519/blob/master/STM32F407/crypto/asm/cortex_m4_mpy_fe25519.S, approach: https://eprint.iacr.org/2018/286) is faster hehe.

If I may ask, did you craft the assembly entirely by hand, or use some assembly-generating framework based on general principles? The hole in my understandability argument above is following the assembly itself, which I hope to improve with "helpful explanations", but some kind of optimality argument based on principled building blocks would be even nicer...

Emill commented 3 years ago

All assembler code has been carefully been written by hand, to get an optimal register allocation etc. Haase's work is very similar, but has slightly more overhead. The multiplication routine is just a standard "schoolbook multiplication" implementation, but ordered in a way to reduce the memory overhead (since the full data does not fit in registers). The reduction part at the end of the P-256 field multiplication is just a carefully unrolled Montgomery reduction, again to optimize register allocation.

I've added a new repo here: https://github.com/Emill/P256-Cortex-M4 which adds ECDSA support and hence implements arithmetics for the group order. I've chosen to use a variant of Algorithm 12.1 from https://gcd.cr.yp.to/safegcd-20190413.pdf for the group inversion. This allows to avoid forcing the user to generate another 256-bit random value, but still gives more or less the same performance for the inversion itself as the method I described earlier. The repo is not 99% asm, but all relevant low level code is, where we need the performance. The high level parts are written in C.

nickray commented 3 years ago

Very interesting, thank you for sharing! The choice of license is also rather appreciated :)