bitcoin-core / secp256k1

Optimized C library for EC operations on curve secp256k1
MIT License
2.06k stars 1k forks source link

Low-footprint mode #189

Open gmaxwell opened 9 years ago

gmaxwell commented 9 years ago

The current strategy is optimized for large fast chips with huge cache. Especially signing would be useful on some embedded devices where multiple megabytes of pre-computation is not acceptable. Reasonably low memory can be accomplished just by lowering window usage, but we could also consider a separate compile-time low-footprint option.

sipa commented 9 years ago

Ideas:

gmaxwell commented 9 years ago

Signing table is now static by default, which delivers a significant fraction of the need here for small devices.

douglasbakkum commented 8 years ago

As a side note to PR #337, is there a security risk for using the wNAF algorithm (i.e. secp256k1_ecmult_const()) for signing? This would eliminate the need for the large precomputed context tables. The tradeoff is wNAF runs at half the signing speed:

Default setup
./bench_sign
ecdsa_sign: min 83.5us / avg 89.6us / max 95.3us

Using wNAF (with blinding)
./bench_sign 
ecdsa_sign: min 174us / avg 178us / max 182us

The verification precomputed context tables (~1.375MB in code comments) can also be eliminated by adding the result of two wNAF secp256k1_ecmult_const() calls instead of the current wNAF secp256k1_ecmult() implemenation, again running at half the speed. This should make verification small enough to run on embedded devices (not tested).

Using secp256k1_ecmult() (default)
./bench_verify 
ecdsa_verify: min 128us / avg 132us / max 135us

Using 2x secp256k1_ecmult_const()
./bench_verify 
ecdsa_verify: min 268us / avg 271us / max 275us

Overall, I see 3 different ecmult implementations: secp256k1_ecmult(), secp256k1_ecmult_gen(), secp256k1_ecmult_const(). Each requires its own precomputation table. Using one ecmult implemenation with the smallest table would be useful for embedded systems. It looks like the table size for wNAF secp256k1_ecmult_const() is small at 672 bytes (secp256k1_ge pre_a[ECMULT_TABLE_SIZE(WINDOW_A)]) with satisfactory speeds.

I am happy to give it a go if it makes sense.

gmaxwell commented 8 years ago

For security, the fixed-window NAF precludes one of the blinding techniques we're using (the nums blinding)-- which provides speculative improvements against power side-channel attacks. I'd be sad to lose it, but it's likely not critical. I've also personally scrutinized its side-channel behavior much less carefully than ecmult_gen (because its newish and it's also considerably more complex) though this can be fixed.

I'm surprised your signing with ecmult_const was only half the speed, esp since ecmult_const precomputes its own table on the fly. Haven't tried it myself. If so, it suggests to me that making ecmult_gen also use the fixed window NAF but with the current table size (which would take 32kb) would actually be faster than ecmult_gen as is, in spite of the NAF conversion costs and constant time negations (presumably due to needing half the number of constant time conditional moves).

In verify; Changing to 2x secp256k1_ecmult_const() is a 2.36x slowdown on my laptop-- does your version pass the tests?

ecdsa_verify: min 57.2us / avg 57.4us / max 57.7us ecdsa_verify: min 136us / avg 136us / max 137us

Probably better would be ecmult_gen + ecmult_const for this-- for me thats taking only 106 microseconds, so only a 1.84x slowdown; and is sharing the same table with signing. It's still needlessly losing a lot of performance due to constant size operation. But I suppose if you're being very size miserly you wouldn't want code bloat for non-constant time versions of these functions. Similarly, some further code size reduction by replacing some of the specialized code with more generic versions (e.g. the weaker normalization with the full normalization).

Alternatively, ecmult_const could be made to support performing multiple multiplies concurrently, which would also be much faster than calling it twice... esp if for G it used a pre-computed table. The design of ecmult_const wouldn't make it easy to use different table sizes for different bases in a multi-exp, so I suspect a multi-exp-ified ecmult_const with precomputed G would be slower than ecmult_gen + ecmult_const_variable_base_fixed_base.

We probably need to do some more IPR analysis of the fixed window NAF used in _const before deciding if we could use it in a non-module role.

It looks like the table size for wNAF secp256k1_ecmult_const() is small at 672 bytes

Take care not to conflate ROM tables with RAM ones, ecmult also only uses a window A sized dynamic table, the rest is constant (and ecmult_gen uses almost no dynamic memory). For many cases this makes a big difference; because ram is usually quite limited while rom usually much less so.

douglasbakkum commented 8 years ago

Thank you for the detailed reply.

In verify; Changing to 2x secp256k1_ecmult_const() is a 2.36x slowdown on my laptop-- does your version pass the tests?

Yes, it passes the tests. I just checked again and got the same performance numbers. This is the quick hack I used: https://github.com/dbitbox/secp256k1/blob/smallverify/src/ecmult_impl.h#L269

A 32kB ecmult_gen table would be workable for me. Then an ecmult_gen + ecmult_const sounds attractive for verification.

One other situation to give motivation for making the code size as small as possible is the use case of verifying the authenticity of signed firmware updates. In this case, the verification code would need to share space inside a bootloader, which of course should be small in order to give more room for the firmware.

vhnatyk commented 5 years ago

Hi - any progress on this? I see it's hanging for soo long. Need it desperately for an embedded device with 256KiB RAM and 1MiB flash and trying to find the right way and balance to reduce RAM footprint of ecmult precompute - as it tries to fill all RAM and kills the app.

real-or-random commented 5 years ago

@vhnatyk Have you checked the latest revisions? We made a lot of progress for use on embedded devices in the past weeks (PRs #557, #566, #595, #596). In particular you can configure the size of the ecmult table used for verification (#596) and building the table does not require a lot of additional RAM now (PR #557).

real-or-random commented 5 years ago

This issue is related to #622.

vhnatyk commented 5 years ago

@real-or-random Yes - sorry that was a bit lazy question in hopes someone competent and focused would reply.

Have you checked the latest revisions?

Y, by the time of your reply I got some bits of info from connected PRs about WINDOW_A and WINDOW_G tuning 😄 but, since this page was the only I got after a day or two of googling, hopefully, your reply as a contributor will save somebody else's good few hours, not to mention a couple more of research through the confusing tree of PR's.

Btw I came here from rust-secp256k1 noticed you there as well. Hopefully, it will get update sources soon as I see PR is open already. Not sure I'm fit for PR or review, but going to work on that and willing to help - saw there some discussion/PR's started in rust-secp256k1#115 - will continue there. Thanx 👍

elichai commented 5 years ago

Isn't it possible to verify/sign without any precumputed tables? Just manually doing all the needed multiplications

vhnatyk commented 5 years ago

Isn't it possible to verify/sign without any precumputed tables? Just manually doing all the needed multiplications

Yes, probably - but what about security of constant time and/or performance. Hmm need to find out with tests, what exactly from secp256k1 I'm using

elichai commented 5 years ago

hmm I'm not big on side channel attacks, if you do all the needed multiplications it will take a different time depending on the private key. can you leverage that? and is that not the same even with tables? like an odd vs even private key should take different time to compute the pubkey.

sipa commented 5 years ago

@elichai There really isn't any need for that. We have perfectly fine variable time and constant time algorithms. There is also no need to run without tables entirely; just make them small enough so it's ok (you can very reasonably make them just a few kB or even less). Doing so will still be vastly more efficient than using algorithm with no tables at all.

vhnatyk commented 5 years ago

There is also no need to run without tables entirely; just make them small enough so it's ok (you can very reasonably make them just a few kB or even less). Doing so will still be vastly more efficient than using algorithm with no tables at all.

Yes, exactly 👍 - the problem I couldn't make it work with all the defines like ECMULT_WINDOW_SIZE setting to 2 etc. I'll try to find exactly why it fails on mcu, and if will find out something meaningful will share here, thanx for help 😄

sipa commented 5 months ago

1058 will support signing & key generation with a 2 KiB precomputed table.