How to reduce code size

skoe commented 3 years ago

Hi Loup,

we are still working on integrating Monocypher. As our target has quite limited code space, size is always an issue. We are happy about BLAKE2_NO_UNROLLING and wondering whether there are more potential places to re-roll some code, even if it comes with a (small) speed penalty. Maybe you even have an older, rolled version of some functions we could benchmark?

These are the largest functions at the moment:

0000026a t poly_block
00000458 t fe_sq
000005f8 t fe_mul
00000898 t blake2b_compress
0000095a t scalarmult.constprop.5

tankf33der commented 3 years ago

@skoe

Whats is your target and tool chain?
Did you already tried “-Os” switch for size optimization in your compiler?
Also feel free to remove functions you dont use.

skoe commented 3 years ago

It's an Cortex M3, compiled with arm-none-eabi-gcc -c -std=c99 -Wall -Wextra -Werror -pedantic-errors -Os -ggdb -mthumb -mcpu=cortex-m3 -fno-builtin -DBLAKE2_NO_UNROLLING. All unused functions are #ifdef'd.

tankf33der commented 3 years ago

This repo contains patch from @LoupVaillant for demo and experimenting to fit ed25519 code to RL78 (16bit, 32KB) for some user from internet. After benchmarking (slow on this chip) they switch to more powerful model and more memory and get satisfied.

I could imagine @LoupVaillant re-implement functions in 32bit space and create a separate repo for this to reduce size and performance for embedded devices.

LoupVaillant commented 3 years ago

Monocypher does not play nice with 16 bit CPUs. 64-bit arithmetic has a nasty tendency to bloat the code sizes. Someday I'll write a 16-bit edition. Now on to my advice:

#ifdef'ing functions is often not needed: compilers can often prune out unused symbols.
64-bit multiplication is the biggest source of bloat in 16-bit CPUs. This affect Poly1305 and Curve25519 (both key exchange and sign/verify). Monocypher is currently not the best tool this job. If you're really really tight, some parts should currently be sought elsewhere:
- I don't know what to recommend for Poly1305.
- For public key crypto, my money is currently on C25519 (Monocypher's tarball contains a copy for performance benchmarks). It will be slower, but also much smaller.
- For Blake2b, I'd recommend changing the hash altogether. Use Blake2s or Blake3 (SHA-256 is possible, but beware length extension attacks with it). I believe the reference implementations in C are good.
If you need Ed25519 compatibility (in the optional files), you won't be able to avoid SHA-512. In that case, drop Blake2b entirely and use SHA-512 (and HMAC) instead. No need for another hash.
fe_sq is a squaring function. Just replace its implementation by fe_mul(h, f, f);. Easy win.
The function fe_mul can be un-unrolled. This should seriously reduce its size. Also, the carry propagation code in FE_CARRY could be simplified and rolled, but that's more delicate (carry propagation is a notorious source of hard to test bugs, I wouldn't trust even my test suite with such fire).
I have no quick fix for Poly1305. Consider using a hash instead for authenticated encryption (Blake2b, Blake2s, or HMAC if you're using SHA-512).
I have no quick fix for the scalarmult constant: the algorithm are intimately tied to the tables, they're not easily removed. Note however that b_comb is only used for signing, and b_window is only used for verification. Also, b_window can easily be reduced by reducing the value of the B_W_WIDTH macro (I believe that with a value of 2, the table only needs one element). b_comb however is harder to shrink, we'd have to recompute it all.
if you can avoid EdDSA altogether, that should remove a significant bit of bloat. Consider doing everything with key exchange, if possible. For instance, instead of signing something, establish an encrypted channel between the trusted source and the device, and send the data through there. This will have far reaching implications of course, but if you're really really really tight on memory, that could be considered.
Heck, in the extreme case, you could consider dropping public key encryption altogether, and manage symmetric keys instead. Mighty cumbersome for servers, but still possible, and fairly secure to boot: you can even keep forward secrecy if you use a ratchet.

skoe commented 3 years ago

Thanks for the quick response.

To prevent a misunderstanding: The Cortex-M3 is a plain 32 bit ARM core, it usually comes in microcontrollers with between a few kByte to 1 MByte of program memory. It has only 32 bit registers and operations, with a few exceptions like multiply with accumulate (32 x 32 + 64) which has a 64-bit result.

We still have enough flash at the moment but there is not much space left for improvements and new features, that's why I want to look for options we might need in future, to avoid a dead end popping up in a few months or so.

It's a really nice list of options. I'll look into some of them, starting with the ones that look easiest and safest, e.g., to replace fe_sq with fe_mul and benchmark the result (I saw the comment in the source already yesterday but didn't try it yet :). I'll let you know the result here.

ghost commented 3 years ago

I had a similar size issue, and was using sha512. I managed to drop the blake2b functions by just removing the vtable for those: https://github.com/LoupVaillant/Monocypher/pull/198

LoupVaillant / Monocypher

How to reduce code size #196