Add `crypto_argon2i_with_ctx` for a more stack friendly argon2i function

Hello!

I have been trying to use crypto_argon2i on a ESP8266 which has only 80KB of RAM and when using the Arduino framework, the stack size is restricted to 4KB (that cannot be changed). Unfortunately, crypto_argon2i is making the device crashing because this function allocates roughly between 3KB to 4KB of stack space.

I have changed my local monocypher.h/monocypher.c to pass a heap allocated crypto_argon2i_ctx that gives these 3KB space from a different place, and I'm now able to run it on this device. The normal function would use these ctx function by having crypto_argon2i_ctx declared locally.

I would like to propose an additional API for argon2i that would allow this. For example:

typedef struct {
    uint8_t buffer[3 * 1024];
} crypto_argon2i_ctx;

void crypto_argon2i_with_ctx(crypto_argon2i_ctx* ctx, 
                    uint8_t       *hash,      uint32_t hash_size,     // >= 4
                    void          *work_area, uint32_t nb_blocks,     // >= 8
                    uint32_t       nb_iterations,                     // >= 3
                    const uint8_t *password,  uint32_t password_size,
                    const uint8_t *salt,      uint32_t salt_size);    // >= 8

void crypto_argon2i_general_with_ctx(crypto_argon2i_ctx* ctx, 
                            uint8_t       *hash,      uint32_t hash_size,// >= 4
                            void          *work_area, uint32_t nb_blocks,// >= 8
                            uint32_t       nb_iterations,                // >= 3
                            const uint8_t *password,  uint32_t password_size,
                            const uint8_t *salt,      uint32_t salt_size,// >= 8
                            const uint8_t *key,       uint32_t key_size,
                            const uint8_t *ad,        uint32_t ad_size);

Then inside crypto_argon2i_general_with_ctx, the code is using crypto_argon2i_ctx.buffer and bump into it, so instead of having something like that:

        u8 initial_hash[72]; // 64 bytes plus 2 words for future hashes
        crypto_blake2b_final(pctx, initial_hash);

it would be replaced by something like that:

        u8* initial_hash = (u8*)pbuffer;   // pbuffer preinitialized at the beginning with ctx->buffer
        pbuffer += 72; // 64 bytes plus 2 words for future hashes
        crypto_blake2b_final(pctx, initial_hash);

What do you think?

Hmm, I believe you are using Argon2i way outside its intended scope. Not only is it memory hard, it uses 64-bit modular multiplication internally to maximise circuit depth. To be any good, it must run on a fast 64-bit CPU, with lots of memory. You are using a slow 32-bit processor with little memory. That's a bad idea.

While you can't do anything about your CPU's speed, you can give it a slow-ish hash that plays to its strengths. Forget about memory hardness, and forget about 64-bit operations. What you need is PBKDF2 with a 32-bit hash: SHA-256, Blake2s, Blake3… or HChacha20. That last one is not a "real" hash, but it's 32-bits, and since it is iterated, its weaknesses presumably don't matter.

My advice: hash your password once with Blake2b, then run PBKDF2/HChacha20 on that hash. With your microcontroller, that's likely the best Monocypher can give you to minimise attacker advantage. Mind you, it won't be terrific (you're still using a relatively slow CPU), but it will almost certainly give you better security than Argon2. To confirm this, you should compare the performance of your microcontroller and your fastest desktop. You want to pick the primitive with the smallest gap. My money is on PBKDF2/Chacha20.

Do tell me if that works for you.

Back to your proposal, that could work, but it's only useful in cases where using Argon2i is (in my opinion) a bad idea to begin with. I can't condone this change. I did notice however that I was using one too many blocks at some point. It hardly changes anything, but it does make the code a bit cleaner. Thanks for bringing this to my attention.

Hmm, I believe you are using Argon2i way outside its intended scope. Not only is it memory hard, it uses 64-bit modular multiplication internally to maximise circuit depth. To be any good, it must run on a fast 64-bit CPU, with lots of memory. You are using a slow 32-bit processor with little memory. That's a bad idea.

Ok, I trust you, I'm definitely not a cryptographic specialist! 😅

Just for the context: I want to create a simple protocol between an IoT and a server where the message are encrypted via crypto_lock/unlock with a key that is changing randomly based on a (crypto)-PRNG.

I thought about using argon2i as a KDF, by feeding it with an initial 128 bytes true random number, and create a new hash of 256 bytes from this input (and reuse part of it for the 32 byte key for crypto_lock...)... iteratively...

hence, why I thought about using argon2i in a simple manner: 8 blocks of 1024 bytes, 3 iterations. It takes roughly 2 to 5ms on an ESP8266 to generate the new hash, which is not that bad for something that would change from time to time (not on every packet).

I could certainly use a chain of Blake2b + PBKDF2/HChacha20 (haven't checked if available in Monocypher?) for that, though I don't know if it would be better than argon2i.

What's your thoughts?

Just for the context: I want to create a simple protocol between an IoT and a server where the message are encrypted via crypto_lock/unlock with a key that is changing randomly based on a (crypto)-PRNG.

Oh, I see. First, Argon2i is a password key derivation algorithm. It is only needed to derive key from low entropy inputs, such as passwords. So is PBKDF2, by the way. If you can use a high entropy input, you want a regular key derivation mechanism: HKDF is the one you generally hear about, but Blake2b in keyed mode also works. Second, you don't even need key derivation. What you are describing seems to call for fast key erasure or a ratchet.

You can implement a CSPRNG on top of Chacha20 the following way:

Set up a work buffer that will contain the random pool. It must be some multiple of 64 bytes.
Fill the first 32 bytes (256-bits) of your buffer with a random seed, and mark the random pool as "zero byte left".
Every time you need random bytes:
- Take what you need from the pool.
- If the pool is empty, then:
  - Refill the whole pool with crypto_chacha20(). Use the first 32 byte of the pool as the key. The nonce is zero. The plaintext is a string of zeroes (Monocypher allows you to use a NULL pointer instead).
  - Pretend the first 32 bytes of the pool are already used (they will be used to fill the next pool).
  - Mark the pool as "pool_size - 32 bytes left".

The minimum pool size you want is 64 bytes. On a desktop computer, I would use 512 bytes. On your micro-controller, I would advise something like 128 bytes, maybe more if you have memory to spare. You can also erase the random bits as soon as you use them. This method is called fast key erasure.

Note that such a deterministic RNG is prone to some subtle yet catastrophic errors: if you copy it in any way, (for instance by forking processes), you will end up repeating random numbers. Only use it when you don't have a choice, like… on that micro-controller, I guess.

Ratchets have a different purpose, but they work the same: the idea is to change your encryption key from time to time. You can use the same method as the RNG above, with one difference: instead using a random seed, you use a shared key, known to both client and server. That shared key can possibly have been determined by a prior authenticated key exchange. Or it may have been manually copied from the server to the device by a trusted employee, or yourself.

Then, every time you send a message, you generate a fresh key from the ratchet. When the server receives a message, it first generates a fresh key from its own ratchet. Since they have the same seed, their state is kept in synchrony, and both client & server will generate the same keys. Note that changing keys for each message that way means you no longer need a nonce, which you can set to zero. You can even gain a bit of efficiency by using Chacha20 instead of XChacha20 in crypto_lock(), so it becomes fully compatible with RFC 8439.

What I like about these schemes is that they only need a stream cipher like Chacha20. So if your program doesn't need to use general purpose hashes elsewhere, you can enjoy a smaller binary size. And before you ask whether we should add those obviously very useful utilities to Monocypher, the answer is no. They belong to a separate project, which I have yet to initiate (right now I'm busy working on authenticated key exchange).

I could certainly use a chain of Blake2b + PBKDF2/HChacha20 (haven't checked if available in Monocypher?) for that, though I don't know if it would be better than argon2i.

Monocypher only provides HChacha20. You'd have to write PBKDF2 on top of it (it's fairly simple). As for which is better, that is highly contextual. I would never recommend PBKDF2 over Argon2i on a modern iOS/Android palmtop for instance: they're not as powerful as desktop computers, but they often have a 64-bit CPU, and they definitely have enough memory to run Argon2i. In the specific case of your micro-controller, you need to test. (Unless you don't: your use case after all doesn't seem to call for a password hash to begin with.)

Thanks @LoupVaillant, I really appreciate all these helpful advices. Indeed I can see that ChaCha20 should be plenty enough to do what I would like to achieve.

Btw, as I had some trouble inferring which functions might cause me some stacksize trouble on this micro-controller, I wrote a small tool to evaluate a maximum stack size (transitive) from an Intel x64 compilation. You might find this useful or it could be handy somewhere maybe in the doc to help understand the usability of these functions on a micro-controller:

# Max stackalloc size in bytes
crypto_argon2i                           4384   (self: 104)
crypto_argon2i_general                   4272   (self: 3352)
crypto_check                             2192   (self: 400)
crypto_check_final                       1784   (self: 1016)
crypto_sign                              1736   (self: 552)
crypto_hidden_key_pair                   1576   (self: 144)
crypto_x25519_dirty_fast                 1424   (self: 552)
crypto_x25519_inverse                    1304   (self: 208)
crypto_sign_init_first_pass_custom_hash  1176   (self: 40)
crypto_x25519                            1168   (self: 72)
crypto_x25519_public_key                 1168   (self: 72)
crypto_x25519_dirty_small                1168   (self: 72)
crypto_key_exchange                      1160   (self: 64)
crypto_sign_public_key_custom_hash       1128   (self: 256)
crypto_sign_init_second_pass             1072   (self: 200)
crypto_hidden_to_curve                   928    (self: 312)
crypto_curve_to_hidden                   848    (self: 232)
crypto_blake2b_general                   704    (self: 392)
crypto_from_eddsa_public                 680    (self: 128)
crypto_from_eddsa_private                640    (self: 328)
crypto_blake2b                           576    (self: 264)
crypto_unlock                            512    (self: 72)
crypto_lock                              496    (self: 72)
crypto_sign_final                        488    (self: 104)
crypto_unlock_aead                       432    (self: 168)
crypto_lock_aead                         416    (self: 152)
crypto_xchacha20_ctr                     352    (self: 88)
crypto_xchacha20                         344    (self: 80)
crypto_chacha20                          320    (self: 56)
crypto_ietf_chacha20                     320    (self: 56)
crypto_ietf_chacha20_ctr                 312    (self: 48)
crypto_blake2b_update                    304    (self: 56)
crypto_blake2b_final                     280    (self: 32)
crypto_chacha20_ctr                      256    (self: 224)
crypto_poly1305                          200    (self: 120)
crypto_blake2b_general_init              160    (self: 160)
crypto_hchacha20                         128    (self: 96)
crypto_poly1305_update                   72     (self: 48)
crypto_poly1305_final                    56     (self: 32)
crypto_check_init_custom_hash            32     (self: 32)

[Edit]Updated as the calculation was cumulating calls instead of taking the max[/Edit]

Oh, that's neat! We should definitely document this somewhere, perhaps add your tools to the tests/ folder. What did you write your tool in?

What did you write your tool in?

It's in C#, can be written easily in python. I'm just using the output of clang -S

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;

namespace StackSizeApp
{
    class Program
    {
        static void Main(string[] args)
        {
            // generated with "C:\Program Files\LLVM\bin\clang.exe"  -O3 monocypher.c -masm=intel -S
            var lines = File.ReadAllLines("monocypher.s");
            var functions = new Dictionary<string, Function>();

            // Parse asm file on .seh sections
            Function currentFunction = null;
            foreach (var line in lines)
            {
                if (currentFunction == null)
                {
                    var match = Regex.Match(line, @"^\.seh_proc\s+([\w_]+)");
                    if (match.Success)
                    {
                        currentFunction = new Function(match.Groups[1].Value);
                        functions.Add(currentFunction.Name, currentFunction);
                    }
                    continue;
                }

                if (line.Contains(".seh_endproc"))
                {
                    currentFunction = null;
                    continue;
                }

                // Record calls
                var call = Regex.Match(line, @"^\s+call\s+([\w_]+)$");
                if (call.Success)
                {
                    currentFunction.CalledFunctionList.Add(call.Groups[1].Value);
                }

                // Record stack alloc
                var matchStackAlloc = Regex.Match(line, @"^\s*\.seh_stackalloc\s+(\d+)");
                if (matchStackAlloc.Success)
                {
                    currentFunction.StackSize = int.Parse(matchStackAlloc.Groups[1].Value);
                }
            }

            // Propagate stack size
            foreach(var func in functions.Values)
            {
                VisitStackSize(func, functions, new HashSet<string>() { func.Name });
            }

            // Display ordered by decreasing stack size
            foreach (var func in functions.Values.Where(x => x.Name.StartsWith("crypto_")).OrderByDescending(x => x.MaxStackSize))
            {
                Console.WriteLine($"{func.Name,-40} {(func.MaxStackSize == int.MaxValue ? "recursive"  : func.MaxStackSize) , -6} (self: {func.StackSize})");
            }
        }

        private const int CallPointerSize = 8;

        private static void VisitStackSize(Function function, Dictionary<string, Function> functions, HashSet<string> visited)
        {
            if (function.MaxStackSize > 0) return;

            var maxChildStackSize = 0;
            foreach (var called in function.CalledFunctionList)
            {
                if (functions.TryGetValue(called, out var calledFunction))
                {
                    // Make sure that recursive function report as recursive
                    var newVisited = new HashSet<string>(visited);
                    if (!newVisited.Add(calledFunction.Name))
                    {
                        function.MaxStackSize = int.MaxValue;
                        return;
                    }
                    VisitStackSize(calledFunction, functions, newVisited);

                    // Propagate recursive
                    if (calledFunction.MaxStackSize == int.MaxValue)
                    {
                        function.MaxStackSize = int.MaxValue;
                        return;
                    }

                    // Record stack size (+ saved return pointer of call)
                    var callSize = CallPointerSize + calledFunction.MaxStackSize;
                    if (callSize > maxChildStackSize)
                    {
                        maxChildStackSize = callSize;
                    }
                }
            }

            function.MaxStackSize = function.StackSize + maxChildStackSize;
        }

        private class Function
        {
            public Function(string name)
            {
                Name = name;
                CalledFunctionList = new HashSet<string>();
            }

            public string Name { get; }

            public int StackSize { get; set; }

            // if == int.MaxValue, the function is recursive
            public int MaxStackSize { get; set; }

            public HashSet<string> CalledFunctionList { get; }

        }
    }
}

Nice, thanks! Do you think you can port it to Python yourself (under the same "public domain" dual license as everything else)? I figured with all your efforts that you deserved to have your name written somewhere, so this could be it.

Or I can write it myself using your code as a guide, and credit you for the idea. Which do you prefer?

Nice, thanks! Do you think you can port it to Python yourself (under the same "public domain" dual license as everything else)?

Ahah, I wrote maybe some python like 20 years ago, I could write it, but I'm likely too rusty and will have to google search every single line equivalent! 😅

I figured with all your efforts that you deserved to have your name written somewhere, so this could be it. Or I can write it myself using your code as a guide, and credit you for the idea. Which do you prefer?

Yeah, you can reuse it entirely, it's quite simple so it can be translated to whatever language you feel productive with.

OK, I'll do it then. Thanks!

I just thought I'd note that gcc can generate stack usage itself using -fstack-usage:

$ gcc -v
...
gcc version 11.2.1 20220115 (Gentoo 11.2.1_p20220115 p4)
$ gcc -c -I./include -fstack-usage src/monocypher.c
$ sort monocypher.su -t '      ' -k 2 -h > monocypher.su.sorted

producing the following output:

src/monocypher.c:1064:13:fe_0               16  static
src/monocypher.c:1065:13:fe_1               16  static
src/monocypher.c:1067:13:fe_copy            16  static
src/monocypher.c:1068:13:fe_neg             16  static
src/monocypher.c:1069:13:fe_add             16  static
src/monocypher.c:1070:13:fe_sub             16  static
src/monocypher.c:1072:13:fe_cswap           16  static
src/monocypher.c:1082:13:fe_ccopy           16  static
src/monocypher.c:110:13:store32_le          16  static
src/monocypher.c:1287:13:fe_mul_small           16  static
src/monocypher.c:137:12:rotr64              16  static
src/monocypher.c:138:12:rotl32              16  static
src/monocypher.c:140:12:neq0                16  static
src/monocypher.c:1566:13:trim_scalar            16  static
src/monocypher.c:1574:12:scalar_bit         16  static
src/monocypher.c:158:6:crypto_wipe          16  static
src/monocypher.c:1668:13:multiply           16  static
src/monocypher.c:1681:12:is_above_l         16  static
src/monocypher.c:368:13:poly_clear_c            16  static
src/monocypher.c:374:13:poly_take_input         16  static
src/monocypher.c:486:13:blake2b_incr            16  static
src/monocypher.c:557:13:blake2b_set_input       16  static
src/monocypher.c:695:13:wipe_block          16  static
src/monocypher.c:720:13:copy_block          16  static
src/monocypher.c:721:14:xor_block           16  static
src/monocypher.c:85:15:align                16  static
src/monocypher.c:90:12:load24_le            16  static
src/monocypher.c:97:12:load32_le            16  static
src/monocypher.c:1787:13:ge_zero            24  static
src/monocypher.c:568:13:blake2b_end_block       24  static
src/monocypher.c:105:12:load64_le           32  static
src/monocypher.c:118:13:store64_le          32  static
src/monocypher.c:1427:13:fe_sq2             32  static
src/monocypher.c:154:5:crypto_verify16          32  static
src/monocypher.c:155:5:crypto_verify32          32  static
src/monocypher.c:156:5:crypto_verify64          32  static
src/monocypher.c:1654:6:crypto_x25519_public_key    32  static
src/monocypher.c:1857:13:ge_cache           32  static
src/monocypher.c:197:13:chacha20_init_key       32  static
src/monocypher.c:2312:6:crypto_sign_public_key      32  static
src/monocypher.c:609:6:crypto_blake2b_init      32  static
src/monocypher.c:670:13:blake2b_vtable_init     32  static
src/monocypher.c:676:13:blake2b_vtable_final        32  static
src/monocypher.c:710:13:load_block          32  static
src/monocypher.c:715:13:store_block         32  static
src/monocypher.c:820:13:unary_g             32  static
src/monocypher.c:152:12:x32             40  static
src/monocypher.c:153:12:x64             40  static
src/monocypher.c:1941:13:ge_double          40  static
src/monocypher.c:673:13:blake2b_vtable_update       40  static
src/monocypher.c:147:12:x16             48  static
src/monocypher.c:1836:12:ge_frombytes_vartime       48  static
src/monocypher.c:2019:13:slide_init         48  static
src/monocypher.c:2344:6:crypto_sign_init_first_pass 48  static
src/monocypher.c:2352:6:crypto_sign_update      48  static
src/monocypher.c:2419:6:crypto_check_init       48  static
src/monocypher.c:2426:6:crypto_check_update     48  static
src/monocypher.c:2840:6:crypto_key_exchange     48  static
src/monocypher.c:665:6:crypto_blake2b           48  static
src/monocypher.c:702:13:blake_update_32         48  static
src/monocypher.c:771:13:g_rounds            48  static
src/monocypher.c:798:13:g_copy              48  static
src/monocypher.c:808:13:g_xor               48  static
src/monocypher.c:866:13:gidx_init           48  static
src/monocypher.c:130:13:store32_le_buf          56  static
src/monocypher.c:133:13:store64_le_buf          56  static
src/monocypher.c:1901:13:ge_madd            56  static
src/monocypher.c:1921:13:ge_msub            56  static
src/monocypher.c:382:13:poly_update         56  static
src/monocypher.c:577:13:blake2b_update          56  static
src/monocypher.c:614:6:crypto_blake2b_update        56  static
src/monocypher.c:124:13:load32_le_buf           64  static
src/monocypher.c:127:13:load64_le_buf           64  static
src/monocypher.c:1697:13:remove_l           64  static
src/monocypher.c:2406:6:crypto_check_init_custom_hash   64  static
src/monocypher.c:2529:13:add_xl             64  static
src/monocypher.c:295:6:crypto_chacha20          64  static
src/monocypher.c:301:6:crypto_ietf_chacha20     64  static
src/monocypher.c:308:6:crypto_xchacha20         64  static
src/monocypher.c:325:13:poly_block          64  static
src/monocypher.c:394:6:crypto_poly1305_init     64  static
src/monocypher.c:436:6:crypto_poly1305_final        64  static
src/monocypher.c:639:6:crypto_blake2b_final     64  static
src/monocypher.c:408:6:crypto_poly1305_update       72  static
src/monocypher.c:1457:13:fe_invert          80  static
src/monocypher.c:2040:12:slide_step         80  static
src/monocypher.c:2318:6:crypto_sign_init_first_pass_custom_hash 80  static
src/monocypher.c:3024:6:crypto_lock         80  dynamic,bounded
src/monocypher.c:3031:5:crypto_unlock           80  dynamic,bounded
src/monocypher.c:2230:13:lookup_add         88  static
src/monocypher.c:1469:12:fe_isodd           96  static
src/monocypher.c:2553:6:crypto_x25519_dirty_small   96  static
src/monocypher.c:274:5:crypto_ietf_chacha20_ctr     96  static
src/monocypher.c:1363:13:fe_sq              104 static
src/monocypher.c:1243:13:fe_tobytes         112 static
src/monocypher.c:1642:6:crypto_x25519           112 static
src/monocypher.c:173:13:chacha20_rounds         112 static
src/monocypher.c:1751:13:reduce             112 static
src/monocypher.c:283:5:crypto_xchacha20_ctr     112 static
src/monocypher.c:888:12:gidx_next           112 static
src/monocypher.c:1022:6:crypto_argon2i          128 dynamic,bounded
src/monocypher.c:1217:13:fe_frombytes           128 static
src/monocypher.c:1479:12:fe_isequal         128 static
src/monocypher.c:203:6:crypto_hchacha20         128 static
src/monocypher.c:2459:6:crypto_from_eddsa_private   128 static
src/monocypher.c:2467:6:crypto_from_eddsa_public    128 static
src/monocypher.c:1307:13:fe_mul             144 static
src/monocypher.c:1548:12:invsqrt            144 static
src/monocypher.c:1867:13:ge_add             144 static
src/monocypher.c:2378:6:crypto_sign_final       144 static
src/monocypher.c:2432:5:crypto_check_final      144 static
src/monocypher.c:466:6:crypto_poly1305          144 static
src/monocypher.c:1795:13:ge_tobytes         176 static
src/monocypher.c:2967:13:lock_auth          176 static
src/monocypher.c:496:13:blake2b_compress        176 static
src/monocypher.c:2815:6:crypto_hidden_key_pair      192 static
src/monocypher.c:2864:13:redc               192 static
src/monocypher.c:2984:6:crypto_lock_aead        192 static
src/monocypher.c:1760:13:mul_add            208 static
src/monocypher.c:2781:5:crypto_curve_to_hidden      208 static
src/monocypher.c:3000:5:crypto_unlock_aead      208 static
src/monocypher.c:587:6:crypto_blake2b_general_init  208 static
src/monocypher.c:1709:13:mod_l              224 static
src/monocypher.c:1891:13:ge_sub             224 static
src/monocypher.c:2358:6:crypto_sign_init_second_pass    224 static
src/monocypher.c:1434:13:fe_pow22523            240 static
src/monocypher.c:2111:12:ge_r_check         256 static
src/monocypher.c:2298:6:crypto_sign_public_key_custom_hash  288 static
src/monocypher.c:2903:6:crypto_x25519_inverse       288 static
src/monocypher.c:655:6:crypto_blake2b_general       304 static
src/monocypher.c:728:13:extended_hash           304 static
src/monocypher.c:216:5:crypto_chacha20_ctr      336 static
src/monocypher.c:2708:6:crypto_hidden_to_curve      336 static
src/monocypher.c:1583:13:scalarmult         400 static
src/monocypher.c:2446:5:crypto_check            432 static
src/monocypher.c:2392:6:crypto_sign         448 static
src/monocypher.c:2252:13:ge_scalarmult_base     496 dynamic,bounded
src/monocypher.c:2589:6:crypto_x25519_dirty_fast    544 static
src/monocypher.c:2073:13:ge_double_scalarmult_vartime   800 static
src/monocypher.c:846:13:gidx_refresh            1088    static
src/monocypher.c:926:6:crypto_argon2i_general       3376    static

Apologies for the overlong comment.

Thanks for the suggestion @milesrout, but I tried this and I believe it is not enough: the stack usage reported in this option doesn't seem to take function calls into account. crypto_check for instance reports 432 bytes used, but I know for a fact it's actually a couple KB, much of which are consumed in ge_double_scalarmult_vartime.

I personally find very frustrating that the dedicated stack usage function from the compiler itself gives us such a useless metric. I don't care about the size of the local stack frame, I care about the size of the tallest tower of stack frames my function ultimately conjures.

You are right indeed. That's a pain. I think one could conjure up a little script to calculate this using -fcallgraph-info.

I did just that. Quite the hack, but here it is.

First, I compile the stack usage in monocypher.su:

add_xl 64
align 16
blake2b_compress 192
blake2b_set_input 16
blake2b_vtable_final 32
blake2b_vtable_init 32
blake2b_vtable_update 40
blake_update_32 48
chacha20_rounds 112
copy_block 16
...

Then I compiled the edges of the call graph in monocypher.cg:

add_xl load32_le
add_xl store32_le
blake2b_compress rotr64
blake2b_vtable_final crypto_blake2b_final
blake2b_vtable_init crypto_blake2b_init
blake2b_vtable_update crypto_blake2b_update
blake_update_32 crypto_blake2b_update
blake_update_32 crypto_wipe
blake_update_32 store32_le
chacha20_rounds rotl32
...

Finally, I ran the following Python3 script:

cg = open("monocypher.cg", 'r')

su = {}
for l in open("monocypher.su", 'r'):
    s = l.split()
    key = s[0]
    val = int(s[1])
    su[key] = val

cg = {}
for l in open("monocypher.cg", 'r'):
    s = l.split()
    key = s[0]
    val = s[1]
    if key in cg: cg[key].append(val)
    else        : cg[key] = [val]

def total_stack_usage(f):
    local = su[f] if f in su else 0
    calls = cg[f] if f in cg else []
    usage = map(total_stack_usage, calls)
    return local + max(usage, default=0)

af = {}
for f in su.keys(): af[f] = True
for f in cg.keys(): af[f] = True

ttu = []
for f in af.keys():
    ttu.append((total_stack_usage(f), f))
ttu = sorted(ttu, reverse=True)

for usage, f in ttu:
    print(usage, f)

Here's the result for the functions that matter:

4752    crypto_argon2i
2064    crypto_check
1872    crypto_sign
1680    crypto_hidden_key_pair
1336    crypto_x25519_inverse
1192    crypto_x25519_public_key
1160    crypto_x25519
896 crypto_unlock
864 crypto_lock
856 crypto_hidden_to_curve
776 crypto_curve_to_hidden
672 crypto_blake2b
640 crypto_xchacha20
624 crypto_blake2b_general
560 crypto_poly1305
528 crypto_chacha20
256 crypto_hchacha20
208 crypto_verify64
168 crypto_verify32
128 crypto_verify16
16  crypto_wipe

I could write something cleaner to get those results in a systematic manner, but I'm feeling lazy. Still, I'm glad to have taken the effort, there's a couple insights there:

Argon2i requires a ton of stack space: 4.7KB
EdDSA verification comes at a distant second with 2KB. A good compromise between speed and stack usage in my opinion.
Key exchange itself requires a little over 1KB, not much we can do about it.
I am fairly disappointed that AEAD seems to require close to 900 bytes. Likewise, Chacha20 and Poly1305 seem to require more space than I anticipated. I hope that's because my method over-estimates stack usage, but it could also be some design flaw… or a compilation option that leaves tons of stuff on the stack. Hmm :thinking:

LoupVaillant / Monocypher

Add `crypto_argon2i_with_ctx` for a more stack friendly argon2i function #207