Document the implementation of 25519 arithmetic

Casual reviews of Monocypher's elliptic curve code does not tell the reader where the code comes from, nor provides any hint about how potential limb overflow were addressed. There is also no warnings about implementation defined signed right shifts. This causes some reviewers to distrust Monocypher.

We should add a number of comments:

A justification of the choice of right shift, despite the standard saying it's "implementation defined".
A note that the code comes from SUPERCOP Ref10.
Comments demonstrating the lack of overflow. Preferably re-derived from scratch (so we can check we reach the same conclusions as Ref10). Possibly a proof in the main web site, as was done for Poly1305.

The manual could possibly mention the right shifts, but I don't think there's a single platform in current use that fails to propagate the sign bit. And if there is one, it will trigger loud failures, not vulnerabilities. I don't think this is worth writing it down there.

Some comments here as this is something I was concerned as well in my own library in the past: https://github.com/mratsim/constantine/blob/26133562/constantine/elliptic/ec_endomorphism_accel.nim#L81-L106 (note: currently the arithmetic right shift is not needed anymore after refactoring)

The C standard is indeed implementation defined but GCC does guarantee arithmetic right shift: https://gcc.gnu.org/onlinedocs/gcc/Integers-implementation.html#Integers-implementation Consequently, all compilers that target GCC compatibility (Clang, ICC) also implement it that way. I didn't search for MSVC documentation though but at least you can guarantee proper compilation on GCC.

Alternatively, I don't know where in the code you use arithmetic right shift but C bitfields can also be used for sign extension as per http://graphics.stanford.edu/~seander/bithacks.html#FixedSignExtend and guaranteed by the C standard. If you store small sign-integer say int25 or int26 this can be used to extract them properly sign-extended from packed representation without arithmetic right shift.

int x; // convert this from using 5 bits to a full int
int r; // resulting sign extended number goes here
struct {signed int x:5;} s;
r = s.x = x;

Good to know, thanks.

So, if I get this correctly, I could guarantee the sign extension after the right shift:

int64_t x = possibly_negative_number;
int64_t y = x >> 26;
struct { signed int64_t x:38 } s; // 38 = 64 - 26
int64_t r = s.x = x;

So if my compiler ends up using logical shift instead of arithmetic shift, I can extend the sign right back. The right shift is still implementation defined, though.

Another way would be to force the shift to be logical, and then extend the sign:

int64_t x  = possibly_negative_number;
uint64_t p = x; // that one is definitely positive
int64_t  y = x >> 26; // logical shift, totally defined
struct { signed int64_t x:38 } s; // 38 = 64 - 26
int64_t  r = s.x = x;

That's heavy, and I'd have to check whether optimizers simplify this down to an arithmetic shift instruction. In any case, it's good to know it's available if we want to get really paranoid.

Might be total overkill, considering GCC's decision to guarantee sign extension. But there is value in stopping meaningless nitpick. I'll try, this, thank you.

OK, I tried messing the the main carry propagation code, where going fully defined potentially had the most impact. I replaced this:

#define FE_CARRY                                                        \
    i64 c;                                                              \
    c = (t0 + ((i64)1<<25)) >> 26; t1 += c;      t0 -= c * ((i64)1 << 26); \
    c = (t4 + ((i64)1<<25)) >> 26; t5 += c;      t4 -= c * ((i64)1 << 26); \
    c = (t1 + ((i64)1<<24)) >> 25; t2 += c;      t1 -= c * ((i64)1 << 25); \
    c = (t5 + ((i64)1<<24)) >> 25; t6 += c;      t5 -= c * ((i64)1 << 25); \
    c = (t2 + ((i64)1<<25)) >> 26; t3 += c;      t2 -= c * ((i64)1 << 26); \
    c = (t6 + ((i64)1<<25)) >> 26; t7 += c;      t6 -= c * ((i64)1 << 26); \
    c = (t3 + ((i64)1<<24)) >> 25; t4 += c;      t3 -= c * ((i64)1 << 25); \
    c = (t7 + ((i64)1<<24)) >> 25; t8 += c;      t7 -= c * ((i64)1 << 25); \
    c = (t4 + ((i64)1<<25)) >> 26; t5 += c;      t4 -= c * ((i64)1 << 26); \
    c = (t8 + ((i64)1<<25)) >> 26; t9 += c;      t8 -= c * ((i64)1 << 26); \
    c = (t9 + ((i64)1<<24)) >> 25; t0 += c * 19; t9 -= c * ((i64)1 << 25); \
    c = (t0 + ((i64)1<<25)) >> 26; t1 += c;      t0 -= c * ((i64)1 << 26); \
    h[0]=(i32)t0;  h[1]=(i32)t1;  h[2]=(i32)t2;  h[3]=(i32)t3;  h[4]=(i32)t4; \
    h[5]=(i32)t5;  h[6]=(i32)t6;  h[7]=(i32)t7;  h[8]=(i32)t8;  h[9]=(i32)t9

by that:

typedef struct { i64 x:39; } s25;
typedef struct { i64 x:38; } s26;
i64 shift25(i64 x) { s25 s; s.x = ((u64)x)>>25; return s.x; }
i64 shift26(i64 x) { s26 s; s.x = ((u64)x)>>26; return s.x; }

#define FE_CARRY                                                        \
    i64 c;                                                              \
    c = shift26(t0 + ((i64)1<<25)); t1 += c;      t0 -= c * ((i64)1 << 26); \
    c = shift26(t4 + ((i64)1<<25)); t5 += c;      t4 -= c * ((i64)1 << 26); \
    c = shift25(t1 + ((i64)1<<24)); t2 += c;      t1 -= c * ((i64)1 << 25); \
    c = shift25(t5 + ((i64)1<<24)); t6 += c;      t5 -= c * ((i64)1 << 25); \
    c = shift26(t2 + ((i64)1<<25)); t3 += c;      t2 -= c * ((i64)1 << 26); \
    c = shift26(t6 + ((i64)1<<25)); t7 += c;      t6 -= c * ((i64)1 << 26); \
    c = shift25(t3 + ((i64)1<<24)); t4 += c;      t3 -= c * ((i64)1 << 25); \
    c = shift25(t7 + ((i64)1<<24)); t8 += c;      t7 -= c * ((i64)1 << 25); \
    c = shift26(t4 + ((i64)1<<25)); t5 += c;      t4 -= c * ((i64)1 << 26); \
    c = shift26(t8 + ((i64)1<<25)); t9 += c;      t8 -= c * ((i64)1 << 26); \
    c = shift25(t9 + ((i64)1<<24)); t0 += c * 19; t9 -= c * ((i64)1 << 25); \
    c = shift26(t0 + ((i64)1<<25)); t1 += c;      t0 -= c * ((i64)1 << 26); \
    h[0]=(i32)t0;  h[1]=(i32)t1;  h[2]=(i32)t2;  h[3]=(i32)t3;  h[4]=(i32)t4; \
    h[5]=(i32)t5;  h[6]=(i32)t6;  h[7]=(i32)t7;  h[8]=(i32)t8;  h[9]=(i32)t9

(The test suite passes, so I'm very confident this works as intended.)

The cost in complexity is surprisingly acceptable. The cost in performance however, is not. Before the change, I had these results on my Core i5 Skylake Laptop:

x25519              :  7993 exchanges  per second
EdDSA(sign)         : 14427 signatures per second
EdDSA(check)        :  6058 checks     per second

After the change, I get those:

x25519              :  5659 exchanges  per second
EdDSA(sign)         : 11384 signatures per second
EdDSA(check)        :  4268 checks     per second

Basically a 30% performance drop just because the optimizers didn't manage to translate shift25() and shift26() down to simple arithmetic shifts. I'm sure it could (it manages to optimise my byte-by-byte loads & stores down to a single instruction), but I guess this is too niche to be addressed just yet.

We could make it a compilation flag, but this would obscure the code for no practical benefit. I'll Add a comment explaining how to emulate arithmetic shifts if it's somehow unavailable.

LoupVaillant / Monocypher

Document the implementation of 25519 arithmetic #185