bitcoin-core / secp256k1

Optimized C library for EC operations on curve secp256k1
MIT License
2.06k stars 1k forks source link

Add extra API for offline/online phases in ECDSA signing #1610

Open RandomLattice opened 4 days ago

RandomLattice commented 4 days ago

We'd love to split the ECDSA signing operation into two steps:

The advantage here is that the offline step can be precomputed arbitrarily early and do "most of the computational work". This is very useful for signers that have a lot of idle time before the signing request comes in, but little time to compute and return the signature. For example, hardware wallets.

In ECDSA, the offline step can essentially compute nonce generation + scalar multiplication + nonce modular inversion. These values ("precomputed material") are input to the online part. The online part just finishes the ECDSA computation. The online part is extremely fast (we're seeing around 5000x faster than the offline phase in a non-libsecp256k1 proof-of-concept).

The precomputed material is secret, can only be used once and needs to be wiped after usage.

API-wise this would probably mean adding two functions:

The composition of the two functions compute secp256k1_ecdsa_sign(). The input/output behavior does not change.

Is there any appetite for this? Happy to write the code for this.

real-or-random commented 1 day ago

The precomputed material is secret, can only be used once and needs to be wiped after usage.

Indeed, and this is the drawback of this. It's a major footgun, and I'm sure this is the reason why it's currently implemented this way.

Personally, I'm a bit on the fence. We're about to add a MuSig2 module, which is also prone to nonce-reuse failures -- but in that case, it is unavoidable to split nonce generation and the actual signing into two separate steps. In the end, I don't think I'm against giving users advanced and dangerous features, but they should be clearly marked as such.

I'd love to hear your use case and scenario here. Noone has asked for this before, and I doubt many users will benefit. (And anyway, ECDSA can be considered somewhat deprecated in Bitcoin, but okay, the same could be done for Schnorr signing). Do you personally need/want this, or is this is just a suggestion for improvement?

It would be interesting to know what other contributors and maintainers think.

RandomLattice commented 1 day ago

this is the drawback of this. It's a major footgun

I think we have ways to make this less of a footgun. For example, here goes a simple mitigation. We can add to the precomputed material data struct a flag has_been_used meant to detect reuse. The consumer function of the struct uses this flag as follows:

Like this, a user has to go out of their way to use this unsafely (by modifying fields in an opaque struct). This doesn't cover the contrived case where the user is reusing precomputed material that got persisted before the flag change; but I think that's a narrow corner case that can be dealt with a clear warning: do not reuse nonce material if you're using this API.

I imagine you've considered similar things for MuSig2, curious what you think.

I don't think I'm against giving users advanced and dangerous features, but they should be clearly marked as such

I think this is the right tradeoff. We can clearly mark the precomputed material as secret and that it cannot be re-used.

I'd love to hear your use case and scenario here. [...] Do you personally need/want this, or is this is just a suggestion for improvement?

I'm asking others to chime in and give better context. Long story short is that we're hitting a performance bottleneck in Bitkey hardware wallet (https://github.com/proto-at-block/bitkey). We need to speed up (by a lot) the signing operation for large transactions if we want to use libsecp256k1.

real-or-random commented 22 hours ago

this is the drawback of this. It's a major footgun

I think we have ways to make this less of a footgun. For example, here goes a simple mitigation. We can add to the precomputed material data struct a flag has_been_used meant to detect reuse. [...]

Like this, a user has to go out of their way to use this unsafely (by modifying fields in an opaque struct).

This doesn't cover the contrived case where the user is reusing precomputed material that got persisted before the flag change; but I think that's a narrow corner case that can be dealt with a clear warning: do not reuse nonce material if you're using this API.

I don't think it's contrived. What if you run inside a VM and the VM is reset? What if the caller decides to memcpy the nonce, or write it to disk? (We can document that one must not do this, but this doesn't necessarily mean that people will adhere to the rules.)

The second large concern is that one cannot use deterministic nonce generation when the message is not known. So you'll need a good source of randomness during signing.

But yeah, we're apparently willing to accept that risk in the MuSig2 API, so it's not entirely crazy to add a similar API also for ordinary signatures.

I imagine you've considered similar things for MuSig2, curious what you think.

In the MuSig2 PR, we simply overwrite the nonce with zeros after using it. That's a tiny bit safer and doesn't need a flag.

Long story short is that we're hitting a performance bottleneck in Bitkey hardware wallet (proto-at-block/bitkey). We need to speed up (by a lot) the signing operation for large transactions if we want to use libsecp256k1.

I'm curious why you use ECDSA at all in a modern wallet. Schnorr signing is 27% faster on my machine, I'd be curious about the difference on the hardware wallet. The difference is probably mostly due to the RFC6979 hashing in ECDSA, which is overkill. So switching to a faster nonce derivation function should already speed up things a lot without introducing risks.

Also, have you tried with a recent version (>=0.5.0), and have you tried all precomputation table sizes on your platform? (See https://github.com/bitcoin-core/secp256k1/blob/master/CHANGELOG.md#changed-1, note that we changed the default in 0.5.1)

RandomLattice commented 21 hours ago

Thanks for the great discussion.

I don't think it's contrived. What if you run inside a VM and the VM is reset? What if the caller decides to memcpy the nonce, or write it to disk? (We can document that one must not do this, but this doesn't necessarily mean that people will adhere to the rules.)

I agree rollback is bad. I think in the VM case, one would not use the new proposed API. One would just use the existing API. The new proposed API is suitable for a specific use case we're hitting "in real life": an embedded device (=constrained computing device) that a) needs to sign "very fast" and b) has plenty of time for precomputation before the signing request comes in.

The second large concern is that one cannot use deterministic nonce generation when the message is not known.

We cannot use RFC6979 to generate the secret nonce when the message is not known. But that doesn't mean we can use other approaches for deterministic nonce generation (see next paragraph).

So you'll need a good source of randomness during signing.

Yes, that is one way, but not the only one. Alternatively, one can use a monotonic counter / unique (public) value as input to a keyed hash to construct the secret nonce. I'm not necessarily advocating for this; it all depends on the specific use case and definitely we're in the territory of "you need to know what you're doing". I think the new API should be generic enough to allow the user to do this kind of choices.

But yeah, we're apparently willing to accept that risk in the MuSig2 API, so it's not entirely crazy to add a similar API also for ordinary signatures.

Sounds reasonable, thanks.

In the MuSig2 PR, we simply overwrite the nonce with zeros after using it. That's a tiny bit safer and doesn't need a flag.

Sure, that would also work. Happy to take this approach too.

have you tried with a recent version (>=0.5.0), and have you tried all precomputation table sizes on your platform?

I'll let others chime in, I don't have the numbers right now with me. Thanks for all the suggestions!

jmecom commented 17 hours ago

Hopping in here -- this would be useful for Bitkey's firmware, which does transaction signing over NFC using libsecp256k1. The chip we're using is an EFR32MG24, so it's not particularly fast, and we've recently had some performance issues when users try to sign many inputs back-to-back in one NFC session on their phone.

We're considering using our chip's hardware accelerated ECDSA, but we prefer libsecp256k1 for various reasons; and the hardware doesn't support Schnorr signatures naturally, so we're a bit stuck there without a different optimization.

I'm curious why you use ECDSA at all in a modern wallet. Schnorr signing is 27% faster on my machine, I'd be curious about the difference on the hardware wallet.

Not all exchanges support sending to bech32m, so it's a bit difficult to eschew ECDSA completely. On our chip, ECDSA takes ~60ms and Schnorr takes ~55ms. We can't use large precomputation tables due to limited RAM.

I do think the optimization being proposed here is mostly useful for resource constrained embedded systems, but it could be very useful in those settings.

apoelstra commented 17 hours ago

I think it's reasonable to expose such an API for both ECDSA and Schnorr. I would suggest doing it in much the same way that we expose custom nonce functions -- provide a second API that takes a callback and auxiliary data, whose role it is to produce a nonce keypair and inverted nonce.

This would be symmetric with our existing ability to override the RFC6979, which is similarly niche and dangerous. (Though the "typical" use of overriding the nonce would be to use a faster PRNG, which is perfectly safe, while the "typical" use of overriding the nonce computation would be to use precomputed nonce values, which is fraught.)

I would not suggest exposing an easy-to-use or first-class API for this. It's much more difficult to obtain a monotonic counter (or strong RNG) and to ensure your software is never forked or reset, than it is to use an awkward C API. So if we were to provide a friendly API, we'd just be making the "easy" part easier and potentially encourage people to use it without carefully considering whether they can do so safely.

sipa commented 16 hours ago

@jmecom Sorry for the slight tangent here, but:

On our chip, ECDSA takes ~60ms and Schnorr takes ~55ms. We can't use large precomputation tables due to limited RAM.

Since #1058 landed in 0.5.0, the supported table sizes (for the key generation/signing code) are 2 KiB, 22 KiB, and 86 KiB. How much is acceptable to you? It's easy to add more configurations if there is a need for them.

jmecom commented 13 hours ago

@jmecom Sorry for the slight tangent here, but:

On our chip, ECDSA takes ~60ms and Schnorr takes ~55ms. We can't use large precomputation tables due to limited RAM.

Since #1058 landed in 0.5.0, the supported table sizes (for the key generation/signing code) are 2 KiB, 22 KiB, and 86 KiB. How much is acceptable to you? It's easy to add more configurations if there is a need for them.

I just pulled in 0.5.1 and experimented with this. I didn't see a huge impact on the table size with a window size of two:

    62 ms @ 2kb
    58 ms @ 22kb
    56 ms @ 86kb

Window size of five took about 50ms @ 22kb. So, I don't think more fine-grained options are really necessary for our usecase at least.

sipa commented 13 hours ago

@jmecom By "window size", you mean the ECMULT_WINDOW_SIZE setting? That should have no impact on key generation/signing speed, only on verification speed.

jmecom commented 12 hours ago

@jmecom By "window size", you mean the ECMULT_WINDOW_SIZE setting? That should have no impact on key generation/signing speed, only on verification speed.

Ah, woops. Due to the way I'm profiling, there is a little variation across each run, so I mistakenly thought that lowered the duration, but it was just an anomaly. Ignore that, but the rest is accurate.