bitcoin-core / secp256k1

Optimized C library for EC operations on curve secp256k1
MIT License
2.09k stars 1.01k forks source link

Compile time SHA256 override? #702

Open gmaxwell opened 4 years ago

gmaxwell commented 4 years ago

It's a bit silly that the library packs in its own sha256 even when the user has their own and on small devices it's a costly waste of space. In places like Bitcoin Core the environment has a fast SIMD or SHA-NI sha256, ... it actually makes a measurable difference in signing time to use SHA-NI.

It would be nice if there were some -Dlibsecp2561k_sha256init=x -Dlibsecp2561k_sha256update=y -Dlibsecp2561k_sha256final=z settings that could be used to compile time substitute in another library and leave the internal one out.

elichai commented 4 years ago

Last time we discussed this on IRC some people said that providing 3 functions might be a bit much and having a single sha256 function instead is better. I think that a single one might be a bit awkward because it requires allocating a big buffer on the stack that can contain everything (and then more things you need to clean).

What are your thoughts on this?

real-or-random commented 4 years ago

I think you'll want this feature only if you seriously care about performance. And in this case, you probably have the more complex API with 3 functions anyway.

apoelstra commented 4 years ago

Also, in practice a single function is often harder to use - if you want to hash a series of things it's nice to just throw them all into a hash engine ... if you need to create a buffer first then (depending on your language) you may need to think about buffer allocation and indexing.

Another alternate might be to just conditionally compile out the sha256 functions, and require the user provide replacements with exactly the correct names in order for linking to succeed. This is what we do for the default error callbacks for example.

gmaxwell commented 4 years ago

only if you seriously care about performance.

I think the vast majority of users would be ones that don't want to have an extra 19kilobytes in their binary.

Another alternate might be to just conditionally compile out the sha256 functions, and require the user provide replacements with exactly the correct names in order for linking to succeed.

That would be ducky too.

elichai commented 4 years ago

Unless anyone already works on this I'd like to give it a try :)

real-or-random commented 4 years ago

Unless anyone already works on this I'd like to give it a try :)

Could you look into this?

elichai commented 4 years ago

Could you look into this?

IIRC the problem I encountered was the SHA2 context. I remember a few solutions but none of them was great:

  1. Decide that the context needs to be size X, provide an opaque struct with unsigned array size X, and the implementer can memcpy this into his own Context back and forth (or type pun it? need to look into what the standard says on that).

  2. The implementer also needs to provide a header with the name sha256.h that defines the struct+functions using the same names/symbols we use.

  3. Provide a c_void pointer to the context in the function's API. (it then requires a reset function to reset the context, and it limits us because we can't hash two things "in parallel")

Any feedback on which is best or maybe a different solution is appreciated.

sipa commented 4 years ago

One possibility is keeping the SHA256 padding/chopping into blocks on the secp256k1 side, and only require external implementations to provide a SHA256 transformation function.

The API would look like:

/** Update the state (array of 8 uint32_t) pointed to by state with 64*blocks bytes of input pointed to by data. */
void sha256_transform(uint32_t *state, size_t blocks, unsigned char *data);

This means the external implementation may need to copy data from/to "array of 8 uint32_t" to their own internal representation - but I suspect almost any C-like implementation already uses that representation anyway.

elichai commented 4 years ago

That's an interesting idea. I do think it will probably require the user to slightly modify his implementation to match this (I've looked at a few sha2 implementations I've found on google, and they all have their own quirks when it comes to the "transform" function)

And is there any real advantage with passing a bunch of blocks at the same time? won't it just call the transform function in a loop? or is there some SIMD/SHA-NI magic this enables?

elichai commented 4 years ago

I played with this, it works pretty nice, but then I realized that one of the reasons were binary size, so I measured using counting asm lines in godbolt(I'm not sure if it's a good way to compare), and exposing RFC6979+HMAC+SHA2+SHA2_transform. with -Os (for minimal size): As-is: 3,523 lines of asm. Omitting the transform function: 429 lines. Omitting all of SHA2: 242 lines.

with -O2: As-is: 3,844 lines. Omitting transform: 511 lines. Omitting all of SHA2: 236 lines.

with -O3: As-is: 6,852 lines. Omitting transform: 3,470 lines. Omitting all of SHA2: 565 lines. (-O3 is crazy lol)

Anyone who wants to play with it: https://godbolt.org/z/MPF3w9 add -DUSE_EXTERNAL_SHA2_TRANSFORM to omit the transform function. add -DUSE_EXTERNAL_SHA2 to omit all of sha2 functions and leave only HMAC+RFC6979

real-or-random commented 4 years ago

For binary size you can really just look at the file size of the binary, see https://github.com/bitcoin-core/secp256k1/pull/700#issuecomment-587503446.

add -DUSE_EXTERNAL_SHA2 to omit all of sha2 functions and leave only HMAC+RFC6979

If at all, I think we would want the opposite. We'll need SHA256 for Schnorr sigs and ECDH, but HMAC/RFC6979 is only used for deriving nonces, and you can use SHA256, too.

elichai commented 4 years ago

If at all, I think we would want the opposite. We'll need SHA256 for Schnorr sigs and ECDH, but HMAC/RFC6979 is only used for deriving nonces, and you can use SHA256, too.

I just had to expose some end result through godbolt to get asm, I used RFC6979 because it doesn't have any complex logic and it uses SHA256 in a complex way that prevent inlining everything and optimizing write together with finalize and initialize (so it's an example to show the asm code of SHA2 not of RFC6979)

real-or-random commented 4 years ago

This really depends on whether we have a compile-time override or a runtime override but we should think about a self-test in the spirit of https://github.com/bitcoin/bitcoin/blob/99813a9745fe10a58bedd7a4cb721faf14f907a4/src/crypto/sha256.cpp#L465 .

elichai commented 4 years ago

I think the main advantage is actually code size, because my tests don't show big advantages for optimized sha2 implementations, although I haven't tested with SHA-NI but I don't think it's going to be a huge improvement in terms of performance, as SHA2 is really fast compared to anything EC related

switck commented 3 years ago

Im making embedded library containing libsecp256k1, see https://github.com/switck/libngu

thoughts:

RE: sharing SHA256 code

real-or-random commented 2 years ago

@elichai Any update here? Maybe, if you currently don't have a lot of time, it would still be good to share your WIP branch, so someone else could adopt the PR.

elichai commented 2 years ago

@elichai Any update here? Maybe, if you currently don't have a lot of time, it would still be good to share your WIP branch, so someone else could adopt the PR.

Sadly I can't find it :(

One possibility is keeping the SHA256 padding/chopping into blocks on the secp256k1 side, and only require external implementations to provide a SHA256 transformation function.

The API would look like:

/** Update the state (array of 8 uint32_t) pointed to by state with 64*blocks bytes of input pointed to by data. */
void sha256_transform(uint32_t *state, size_t blocks, unsigned char *data);

This means the external implementation may need to copy data from/to "array of 8 uint32_t" to their own internal representation - but I suspect almost any C-like implementation already uses that representation anyway.

Do you know if implementations would want to keep the state as a __m128i instead of uint32_t? (I see that bitcoin's implementations load it to vector registers each time, but wondering if that's optimal)

real-or-random commented 2 years ago

It's a bit silly that the library packs in its own sha256 even when the user has their own and on small devices it's a costly waste of space.

In places like Bitcoin Core the environment has a fast SIMD or SHA-NI sha256, ... it actually makes a measurable difference in signing time to use SHA-NI.

Going a step back, I believe that these are separate concerns. As per the discussion here, an override helps mostly to save space.

If we want to profit from hardware optimizations, this is somewhat orthogonal. An override would help here but only if the caller controls the compilation. This is true for Bitcoin Core but in general it's rather an exception. Also, if you look at Core's SHA256 implementation (https://github.com/bitcoin/bitcoin/blob/master/src/crypto/sha256.cpp), it's not clear how the override would look like. Depending on the available hardware, the essential function is either Transform, Transform_2way, Transform_4way, Transform_8way.

This is an argument in favor of optimizing our SHA256 implementation (independent of the possibility to override), and I tend to believe that this is desirable. We didn't want to do this in the past because SHA256 was an implementation detail. But as it's an integral part of Schnorr verification and signing now, things have changed. It's just somewhat messy: We'd need to duplicate code from Core (and worse, change it to C). Or move SHA256 (and maybe other stuff such as ChaCha20 for #760) to a separate library that's linked to secp256k1 and Core. Or expose the SHA256 here and make Core use them. None of this sounds great.

Thoughts?

benma commented 3 months ago

I'd very much like being able to plug in a different sha256 implementation to save binary space.