Open osresearch opened 4 years ago
Hi! Good catch! The solution should be easy enough as far as I see - just introduce another buffer to copy operation to both cases, which would force to always copy data. This would increase stack use by 32 bytes, which is affordable.
And, nope, my pull request doesn't fix it. gcc is too smart for us.
void
modp256k1_add (bn256 *X, const bn256 *A, const bn256 *B)
{
uint32_t cond;
bn256 tmp[1];
cond = (bn256_add (X, A, B) == 0);
cond &= bn256_sub (tmp, X, P256K1);
if (cond)
/* No-carry AND borrow */
memcpy (tmp, X, sizeof (bn256)); // constant time
else
memcpy (X, tmp, sizeof (bn256));
}
becomes:
00000000 <modp256k1_add>:
0: b530 push {r4, r5, lr}
2: b089 sub sp, #36 ; 0x24
4: 4604 mov r4, r0
6: f7ff fffe bl 0 <bn256_add>
a: 4621 mov r1, r4
c: fab0 f580 clz r5, r0
10: 4a0a ldr r2, [pc, #40] ; (3c <modp256k1_add+0x3c>)
12: 4668 mov r0, sp
14: f7ff fffe bl 0 <bn256_sub>
18: 096d lsrs r5, r5, #5
1a: 4205 tst r5, r0 // if (cond)
1c: d10b bne.n 36 <modp256k1_add+0x36> // goto early return without mempcy
1e: 466a mov r2, sp // else actually do a memcpy
20: 4623 mov r3, r4
22: ad08 add r5, sp, #32
24: 4614 mov r4, r2
26: cc03 ldmia r4!, {r0, r1}
28: 42ac cmp r4, r5
2a: 6018 str r0, [r3, #0]
2c: 6059 str r1, [r3, #4]
2e: 4622 mov r2, r4
30: f103 0308 add.w r3, r3, #8
34: d1f6 bne.n 24 <modp256k1_add+0x24>
36: b009 add sp, #36 ; 0x24
38: bd30 pop {r4, r5, pc}
3a: bf00 nop
3c: 00000000 .word 0x00000000
Since tmp
doesn't live on, the entire mempcy()
is discarded.
When in doubt, add more gcc sees through the volatile tricks, too.volatile
?
How about this one below? Looks like -Os
does not strip it away. Obviously less performant due to the additional copy.
In the example I might have used wrong compiler type, but should work in principle.
We need to use own copying routines, otherwise the volatility is stripped away.
Executed in Compiler Explorer:
Something like this:
void nk_memcpy_bn256(volatile bn256 *dest, volatile bn256 *src){
for(uint8_t i=0;i<BN256_WORDS;i++){
dest->word[i] = src->word[i];
}
}
void
modp256k1_add3 (bn256 *X, const bn256 *A, const bn256 *B)
{
uint32_t cond;
volatile bn256 tmp[1];
volatile bn256 tmp2[1];
cond = (bn256_add (X, A, B) == 0);
cond &= bn256_sub (tmp, X, P256K1);
if (cond){
/* No-carry AND borrow */
nk_memcpy_bn256 (tmp2, tmp);
nk_memcpy_bn256 (tmp, tmp2);
}
else{
nk_memcpy_bn256 (tmp2, tmp);
nk_memcpy_bn256 (X, tmp2);
}
}
This gets closer to constant time than before (the memory copy always happens), although there are two extra instructions on one path: https://github.com/Nitrokey/nitrokey-start-firmware/pull/54/commits/64d695726576e7bb5ebbc75033024283e52fc16b
Could this be treated like POD? I missed that, interesting. Regarding your implementation, how about this change? (updated)
void bn256_copy_if (int do_it, bn256 *dest, bn256 *src)
{
static volatile bn256 fake_dest[1];
if (do_it)
*dest = *src;
else
*fake_dest = *src;
}
Edit: https://godbolt.org/z/ozxY9b - looks like temp variable cannot be on function-local stack, otherwise its handled differently. Edit: https://godbolt.org/z/oab597 - this one looks promising
It looks like the modp256k1.c and modp256r1.c use
memcpy(tmp, tmp, ...)
to try to maintain a constant time implementation. This causes a warning with newerarm-none-eabi-gcc
(I'm testing with 9.2.1 from the Ubuntu 20.04 repo):The standard does not guarantee what will happen in this case:
Many libraries will check for overlap and fall back to
memove()
, which can short-circuit in the case of the src == dst. This would prevent the functions from actually being constant time.