Explicit support for embedded targets

LoupVaillant commented 1 year ago

I've been approached lately by @Laczen about possibly making an "embedded edition" of Monocypher, less focused on speed and more focused on reducing footprint (binary & stack sizes). One particular step in this direction is implementing Blake2s.

I have resisted Blake2s for a long time, most notably because of the redundancy with Blake2b (we already have SHA-512 for compatibility, I don't want to make things even worse), but I have to admit it makes a whole lot of sense for what is arguably our biggest user base.

Ideally we would be able to share code between Blake2b and Blake2s, and have Blake2s be supported by a flag with the current Blake2 functions. To be checked, but I believe such sharing is not actually possible because the internal loops of Blake2b use 64-bit integer, while Blake2s uses 32-bit integers instead. So we should probably provide Blake2s separately, in a new optional compilation unit: src/optional/blake2s.h and src/optional/blake2s.c.

There's also Blake3 to consider, since it's 32-bit too, but takes advantage of vector instructions for bigger processors. I have two problems with it though: first, it has tighter security margins, in accordance to Aumasson's Too Much Crypto. Not exactly unsafe, but perhaps a bit risky? Second, it would replace Blake2b, which would be a hard break, and so far that never happened with Monocypher (we broke the API several times, but we never ever completely broke the ability of users to keep their old wire formats —even for AEAD they could use the basic primitives to achieve it again). Another problem with removing Blake2b would be the inability to implement Argon2. A variant would be easy for sure, but we'd lose the standard compliance. Oh, and I recall @fscoto warning me that Blake3 needs a bigger footprint than Blake2s, making it is less ideal for small embedded targets.

Anyway, adding Blake2s would be a step towards the support of embedded targets.

The next step would be writing an actual embedded edition of Monocypher. Something that sacrifices speed so it can be smaller (make sense with the smallest targets, that have limited program memory, and teeny tiny stacks). The problem here is how far we should be willing to go, and that very much depends on the embedded market: are 8-bit and 16-bit processors relevant? Would we even try to do cryptography on them? How about extremely low-powered 32-bit processors? Do they have enough memory so we should optimise for speed to reduce energy consumption? Or do they need smaller crypto code? What multipliers should our bignum arithmetic target? Are the current 32->64 bit multiplications okay, or should we go down to 16->32, or even 8->16 like C25519 does?

All hard questions I don't think I can answer right now. What I can answer for sure is that 32-bit targets all prefer Blake2s over Blake2b. And since it looks like we have actual demand for it, we should probably provide it. Only problem being, well… adding a primitive. We want to stay focused and not turn ourselves into a kitchen sink. @fscoto, do you have some thoughts about it all?

fscoto commented 1 year ago

I've advocated several times for an embedded Monocypher several times in the past and I am still firmly of the belief that it's the way forward.

"Normal" Monocypher has its place in fire-and-forget things. But those are often also served by libsodium or (shudder) OpenSSL or the Windows/macOS platform stack with the exception of Elligator. I don't think that's where most of the deployments are. Of course, it's hard to tell without telemetry where the code mostly goes.

I disagree with BLAKE2s for embedded use in normal Monocypher. Monocypher aims for a small API and that implies few primitives.

You can get a large part of the benefit of BLAKE3's single-core performance by taking the lower round count mentioned in Aumasson's Too Much Crypto (dubbed BLAKE2bf and BLAKE2sf). BLAKE3's tree structure and larger chunk size greatly aids in parallelism and time-per-chunk, but also leads to larger state size (the reference implementation in C weighs in at over 2000 bytes, https://github.com/BLAKE3-team/BLAKE3/blob/67e4d04a3c025288ecb1276800d435577e2008bd/c/blake3.h). However, we also know that standards tick checkboxes and thus deviating from the established standard is rarely met favorably, especially when they use big iron to run the server and tooling implementations, where they'll use things like libsodium or OpenSSL.

The primitives of embedded Monocypher would thus be the subset of the most lightweight primitives implemented in either OpenSSL or libsodium.

As for implementation, people have been working on SipHash for on 16-bit byte processors. There's evidently a market for C25519 (8-bit MCU firmware signing), but you could argue that 32->64 multiplication can be expected because ChaCha20 is "the" fundamental primitive and that is just not going to go anywhere on 8-bit processors; though I assume they'd also hand-optimize cryptographic wherever possible because C compilers leave too much performance on the table there. I think embedded Monocypher should probably aim at processors in the ARM Cortex-M0 (which only has 32x32 -> 32) and SuperH (which only has 32x32 -> 32 and 16x16 -> 32) and RISC-V (RV32/RV64 have no multiplication, the M extension adds 32x32 -> 32 with a way to get the upper bits with a "recommended" sequence of instructions in hopes that the processor optimizes it into a 32x32 -> 64) class.

Would I really prefer embedded uses to migrate to a safe language like Rust instead? Of course I do. Is it happening right now? Maybe. Do people still ask for C in the near and medium term? Yes.

LoupVaillant commented 1 year ago

I disagree with BLAKE2s for embedded use in normal Monocypher. Monocypher aims for a small API and that implies few primitives.

To some extent, so do I actually. In all cases Blake2b is good enough, Blake2s serves basically no purpose, and just adds code and bloat and more API for no substantial benefit. The reasons I suggested we add it anyway are:

An embedded edition for relatively weak CPUs may significantly benefit from Blake2s.
I wanted to have all editions of Monocypher to have the exact same API. Guaranteed compatibility between server and IoT.

But do we want identical APIs? Argon2 in particular makes absolutely no sense on small embedded targets. Embedded users would naturally cut it out one way or another, so why keep it? And if we don't keep it, why not replacing Blake2b by Blake2s as well? There would be no need to pollute mainline Monocypher with an additional hash.

Conversely, are the benefits of Blake2s that significant? Blake2b requires a bit more code and a little more stack, but not that much to be honest: 64-bit additions aren't ideal in 32-bit platforms but it's not as big a deal as multiplications. As for the speed difference, the original paper reports Blake2s being 2x faster on a Tegra 2. That's… not that much faster, considering. Especially if you throw elliptic curves on top. And from the reports I've had from actual embedded users, Blake2b isn't such a problem. Only its unrolled inner loop is (it's slower on some lower-end CPUs), and we already have a compilation option for this.

Personally, I think we do need full compatibility between editions. Makes thing easier on users. Sure they'll never use Argon2 on a PIC (please tell me they won't), but at least they wouldn't have to think about the availability of various algorithms: it's either everywhere, or nowhere. Not only that, but in some contexts like IoT the bottleneck is more likely to be the server than the swarm of IoT devices, and that would favour Blake2b.

Another question I'm not sure about is, do we even need a 32-bit embedded edition? As far as I can tell, mainline Monocypher is already close to ideal there: without Argon2 it's binary size is well under 50KiB, Blake2b isn't that bad, and all the rest is pretty much optimised for 32-bit CPUs, including bignum arithmetic (except perhaps the unrolled multiplication code, but that's easily addressed with a NO_UNROLLING compilation flag). Finally, I'm not certain there are that many 32-bit platforms out there for which this is still too big, especially if you don't use everything. Thus, it is very possible that a couple more compilation flags would be enough to fit Monocypher in weak 32-bit platforms.

If I'm right about this, then an embedded edition should probably target even smaller, 16-bit CPUs. Binary sizes for those are horrible (unrolled loops containing lots of 64-bit multiplications are no joke), and the performance characteristics are different enough that optimal C code would likely look quite different.

Then again, I don't know the embedded market well enough. A quick look at Wikipedia shows that not all SMT32 microcontrollers have enough static RAM to fit all of Monocypher, suggesting I'm dead wrong. However they do have flash memory, and maybe the don't have to load the entire program to RAM before they execute it? I have no idea to be honest.

My take right this minute is that more tests are needed before we commit to anything.

konovod commented 1 year ago

However they do have flash memory, and maybe the don't have to load the entire program to RAM before they execute it? I have no idea to be honest.

They don't load program in RAM at all. RAM is for stack/heap, ROM (flash) is for program code. So flash is (relatively) abudant, main optimization microcontrollers would need is not smaller code but "less RAM used".

Laczen commented 1 year ago

@LoupVaillant, regarding the 50kB, this is a lot on devices that have only 256kB of flash. Probably a general NO_UNROLLING option should provide sufficient optimisation.

Regarding the necessity for blake2s, I am unsure if a hash is even required when using aead. The main reason why a hash would be needed would be for some key derivation function.

@konovod, a general NO_UNROLLING option would reduce both RAM and ROM use.

LoupVaillant commented 1 year ago

@konovod that's good to know. Though I suspect there's also a small hidden instruction cache to avoid loading loop code over and over again, and I believe this may explain why some micro-controller don't like unrolling the Blake2b loop.

@Laczen Simple key exchange + AEAD doesn't require a hash. More elaborate key exchanges such as Noise however do make use of hashes. It is possible to get by with only Chacha20 and HChacha20, but I confess my own attempt at this is very experimental.

konovod commented 1 year ago

Though I suspect there's also a small hidden instruction cache to avoid loading loop code over and over again

Cache isn't needed at say 72MHz (STM32F107), because flash memory can be read every cycle. But for more powerful MCUs like STM32H7 (450MHZ) - yes, there is prefetch (I'm not sure about details).

general NO_UNROLLING option would reduce both RAM and ROM use

When i used some parts of Monocypher (Chacha20) in MCU, main problem for me was stack overflow. I prefer to limit tasks stack to something like 1kb, so I had to change most buffer variables to static. This is possible only when one task works with crypto, so i'm not sure that it's a good solution for a general case.

LoupVaillant commented 1 year ago

@konovod You had to limit the stack usage of ChaCha20?? My, it's worse than I thought, this thing is supposed to require only 200 bytes of stack space. This would make X25519 (1KB of stack) and signature verification (2-3KB of stack) prohibitively expensive…

In any case, it looks like Monocypher may not be your only stack eater. Your program must have others like… the actual business logic I guess. And apparently they're eating your stack at the same time. I mean, the only way to blow a 1KB stack with ChaCha20 is to call it when your stack already exceeds 800 bytes.

More generally, you have several serious stack eaters (Monocypher is one of them), and you are calling several at the same time. Now you found a neat trick by modifying the source code, but many users are unwilling to touch cryptographic code. So I'm wondering: could you reasonably make sure your various stack eaters are called separately, such that your maximum stack height grows no bigger than your biggest stack eater?

mliberty1 commented 1 year ago

Just to add perspective, the new Joulescope JS220 uses Monocypher on an ARM Cortex-M7 microcontroller with 1 MB flash (128 kB available for the updater) and 378 kB RAM, of which 256 kB is available for data. No issues with code, data, or stack size. Monocypher works great as-is for validating and decrypting firmware updates.

I would have liked to use ChaCha-20 & EdDSA for firmware updates to a soft-core 32-bit RISC-V IMC running inside a Lattice ECP5 FPGA with 20 kB RAM for both program and data. The rest of the code is very small, so Monocypher could take more than half of this space. Monocypher was close to fitting, but I couldn't figure out how to make EdDSA smaller. If I recall correctly, the reference EdDSA implementation fit, but it took 17 seconds to run.

In both cases, compatibility with anything is not a requirement. I can encrypt & sign firmware images in implementation-specific ways.

Regardless, thank you for your excellent work on Monocypher!

LoupVaillant commented 1 year ago

I would have liked to use ChaCha-20 & EdDSA for firmware updates to a soft-core 32-bit RISC-V IMC running inside a Lattice ECP5 FPGA with 20 kB RAM for both program and data. The rest of the code is very small, so Monocypher could take more than half of this space. Monocypher was close to fitting, but I couldn't figure out how to make EdDSA smaller. If I recall correctly, the reference EdDSA implementation fit, but it took 17 seconds to run.

Of the top of my head, it depends. One thing that influences code size is the presence of a 32->64 bit mul-add instruction. Without it bignum code gets quite a bit bigger, and rolling the loops there have a significant effect (see #254 for the savings on cortex M0). The best though would probably be to scale down the bignum arithmetic with 16->32 multiplications.

Then we have the more generic code saving reductions:

Implementing fe_sq() in terms of fe_mul() (costs about 15% speed).
Using a simpler ladder for inversions (costs up to 10% speed IIRC).
Replacing point doublings by point additions (doesn't cost that much now that fe_sq() is as slow as fe_mul())
Perhaps get rid of the specialised adds tailored for cached points(5% speed loss or so).
Reducing the size of the tables. Note that for signing, reducing the size of the comb requires reworking all the constants. (Hard to know the impact, but I'd say up to 30% speed loss. Note that small tables have a big impact on performance, so you probably want to keep at least 1 or 2 group elements per table.)

Doing all of the above would probably divide speed by 3, give or take. Needs to be measured of course. It also depends what we do exactly: some of the speed loss compound, others become less important as other parts get slower. Achieving the most code reduction for the least speed loss will require some measurement. I'll need either real chips, or cycle accurate emulators.

fscoto commented 1 year ago

For the reference, the NIST lightweight cryptography competition finished recently and Ascon won for AEAD and hashing. It wouldn't really be a drop-in replacement for ChaCha20 and BLAKE2b because it's got 128-bit key size (AEAD) and 256-bit output size (hash), so a Monocypher edition/sister project built around it would probably need to be thought of differently.

LoupVaillant commented 1 year ago

Ah, I see there's a case to make incompatible changes there. My biggest problem is the IoT/server use case: if we specialise the library for IoT to the point of having different algorithm, we could perhaps need a server version with the embedded alg as well.

I'm also no too keen on 128-bit ciphers. 128-bit collisions resistance (SHA-256, Curve25519…) is more than enough, but when it comes to key search and preimage it's a bit more shaky. I mean it's most probably well beyond the means of any criminal organisation, but state actors… it's less boring than I like.

Finally we need to see performance charts, for comparable security margins. I wonder how much faster than ChaCha8/Poly1305 this new AEAD is.

LoupVaillant / Monocypher

Explicit support for embedded targets #253