LoupVaillant / Monocypher

An easy to use, easy to deploy crypto library
https://monocypher.org
Other
580 stars 80 forks source link

How to reduce code size #196

Closed skoe closed 3 years ago

skoe commented 3 years ago

Hi Loup,

we are still working on integrating Monocypher. As our target has quite limited code space, size is always an issue. We are happy about BLAKE2_NO_UNROLLING and wondering whether there are more potential places to re-roll some code, even if it comes with a (small) speed penalty. Maybe you even have an older, rolled version of some functions we could benchmark?

These are the largest functions at the moment:

0000026a t poly_block
00000458 t fe_sq
000005f8 t fe_mul
00000898 t blake2b_compress
0000095a t scalarmult.constprop.5
tankf33der commented 3 years ago

@skoe

skoe commented 3 years ago

It's an Cortex M3, compiled with arm-none-eabi-gcc -c -std=c99 -Wall -Wextra -Werror -pedantic-errors -Os -ggdb -mthumb -mcpu=cortex-m3 -fno-builtin -DBLAKE2_NO_UNROLLING. All unused functions are #ifdef'd.

tankf33der commented 3 years ago

This repo contains patch from @LoupVaillant for demo and experimenting to fit ed25519 code to RL78 (16bit, 32KB) for some user from internet. After benchmarking (slow on this chip) they switch to more powerful model and more memory and get satisfied.

I could imagine @LoupVaillant re-implement functions in 32bit space and create a separate repo for this to reduce size and performance for embedded devices.

LoupVaillant commented 3 years ago

Monocypher does not play nice with 16 bit CPUs. 64-bit arithmetic has a nasty tendency to bloat the code sizes. Someday I'll write a 16-bit edition. Now on to my advice:

skoe commented 3 years ago

Thanks for the quick response.

To prevent a misunderstanding: The Cortex-M3 is a plain 32 bit ARM core, it usually comes in microcontrollers with between a few kByte to 1 MByte of program memory. It has only 32 bit registers and operations, with a few exceptions like multiply with accumulate (32 x 32 + 64) which has a 64-bit result.

We still have enough flash at the moment but there is not much space left for improvements and new features, that's why I want to look for options we might need in future, to avoid a dead end popping up in a few months or so.

It's a really nice list of options. I'll look into some of them, starting with the ones that look easiest and safest, e.g., to replace fe_sq with fe_mul and benchmark the result (I saw the comment in the source already yesterday but didn't try it yet :). I'll let you know the result here.

ghost commented 3 years ago

I had a similar size issue, and was using sha512. I managed to drop the blake2b functions by just removing the vtable for those: https://github.com/LoupVaillant/Monocypher/pull/198