LoupVaillant / Monocypher

An easy to use, easy to deploy crypto library
https://monocypher.org
Other
594 stars 79 forks source link

Streaming Interface for authenticated encryption? #218

Closed LoupVaillant closed 1 year ago

LoupVaillant commented 3 years ago

Now that Monocypher is just under 1900 lines of code, we have room for more functionality. One that is very tempting to add is streaming AEAD, similar to Libsodium's.

One reason for the temptation is the ability to share code with crypto_lock() and crypto_unlock(). If the streaming API is provided separately, then a bunch of code, most notably authentication, must be duplicated. Inside Monocypher however, we can define the current AEAD interface in terms of the streaming API. I have written a prototype that currently costs 49 lines of code (and 6 new functions, and one additional struct).

On the other hand, we don't want to include something of limited value. We may want other functionality in the future, and it would be a pity if we had to break the 2KLoC psychological barrier to get it. To be included, this streaming interface should satisfy the following criteria:

If possible, it should also satisfy those bonus criteria:

One major question remains about the "no major footguns" criterion: key commitment. As they stand, polynomial hash based ciphers are vulnerable to attacks where the same message could be successfully decrypted under several possible keys. And most importantly, an attacker could find those keys and exploit that for nefarious purposes. There's also a related question about message commitment, where it's possible to find several messages that produce the same authentication tag (that requires knowledge of the key).

We could solve those problems by using a random key robust AEAD, such as Chacha20/Blake2b, or by adding a bunch of encrypted zeroes to the message (well, to each chunk, actually). This is either slower or unwieldy, not to mention the increased size overhead.

I believe there is another way: make sure the protocol leaves no room for tricking the recipient into using the wrong decryption key. Commit the key once, then use ordinary AEAD. Also, we need to determine how much of an issue the lack of message commitment may be.

If this works, then a ChaPoly based streaming interface is probably worth adding. If it does not work, then even crypto_lock() was probably not worth adding, and we'd better not compound the problem with yet another broken construction.

Food for thought, to be considered when we can make the time.

LoupVaillant commented 3 years ago

Okay, I found why people worry about commitment in the context of encrypted messages. Following this paper about partitioning oracles, Age decided to mitigate the attack by limiting the size of its chunks.

Partitioning oracles are about exploiting the fact that some AEAD constructions, including ChaPoly, allow attackers to craft ciphertext that successfully decrypt under more than one key. Instinctively, you would think that a a ciphertext can only decrypt under the key that generated it. Not quite:

The partitioning oracle attack works thus: first, the attacker crafts a ciphertext that decrypts under half of all possible keys, such that the attacker knows that half. Then they send that message to the victim, that candidly leaks whether decryption was successful or not (through an error message or timing leak). Now the attacker has gained one bit of information about the key. Repeat the operation 255 more times, and voilà, the attacker has the whole key, and the victim is pwned.

While practical attacks aren't that powerful (against AES-GCM, the paper talks about covering up to 2^16 keys in a single message, and only up to 10 in the case of ChaPoly), they can still be quite scary in two scenarios:

Age mitigate those attacks by limiting the size of its chunks: with this limitation, it's only possible to cover two keys per query instead of a gazillion.


Now as much as we could discourage scenarios under which there is a decryption oracle, we're gonna have to assume users may fall into this trap. So as a crypto library, it would be best to only provide committing AEAD. Unfortunately, (i) Monocypher already provides a non-committing ChaPoly construction, and (ii) polynomial hashes are fast. So there's seem to be a tension between usability and performance.

That said, there is a way to defeat partitioning Oracles: commit the key before actually using it:

So, if we manage to commit the key before starting the secure session or decrypting the actual message, we should be good. No need to limit the size of chunks like Age does. And since this naturally happens in a number of situation, a non-committing AEAD could be viewed as legitimate.

Still, we must recognise that partitioning oracle attacks are a foot gun, and we must consider carefully whether that should stop us. In any case, we probably want to add a security considerations section about them in the current manual.

snej commented 2 years ago

Encrypted streams would be very good to have.

Sodium's crypto_stream isn't really a stream, however, it's a sequence of messages, and the encrypting and decrypting parties have to agree on what the sizes of the messages are, because the decryption API requires the caller to pass in the entire encrypted message.

This is a pain in the butt for use in network connections, because

LoupVaillant commented 2 years ago

Hmm, I'm not sure we can meaningfully hide message sizes from traffic analysis in an interactive protocol, so I'm not sure Scuttlebutt's workaround really works: when you're sending a message and there's no other message right behind it, you have to send your first message anyway, and encrypting its length won't do you much good if the eavesdropper can just notice a pause in the data stream. If you really want to hide message sizes & boundaries, I have two strategies in mind:

How effective those strategies are really is application dependent. And more to the point, I don't think I can come up with a nice, simple, low-overhead streaming scheme that supports any of the above out of the box.

Then there are the use cases where you do know chunk sizes in advance: file encryption and streaming lots of data. In both cases, chunk length can be a hard coded convention, written at the beginning of the file, or agreed upon during the handshake. For these use cases, libsodium's approach minimises overhead.

Personally, I don't think I can find a lightweight approach that satisfies your use case out of the box. I'm afraid users will have to endure butt pain, at least until I get around to write an actual network library.

LoupVaillant commented 2 years ago

Seems like I finally have a design worth considering (latest version). Here's the API:

typedef struct {
    uint32_t counter; // MSB of the ChaCha20 counter
    uint8_t  key[32];
    uint8_t  nonce[8];
} crypto_stream_ctx;

void crypto_stream_init(crypto_stream_ctx *ctx,
                        const uint8_t key[32], const uint8_t nonce[24]);
void crypto_stream_write(crypto_stream_ctx *ctx,
                         uint8_t            mac[16],
                         uint8_t           *cipher_text,
                         const uint8_t     *plain_text, size_t text_size);
int crypto_stream_read(crypto_stream_ctx *ctx,
                       uint8_t           *plain_text,
                       const uint8_t      mac[16],
                       const uint8_t     *cipher_text, size_t text_size);

void crypto_stream_rekey(crypto_stream_ctx *ctx);
void crypto_stream_init_raw(crypto_stream_ctx *ctx,
                            const uint8_t key[32],
                            const uint8_t nonce[8],
                            uint32_t counter);

// with additional data
void crypto_stream_write_aead(crypto_stream_ctx *ctx,
                              uint8_t            mac[16],
                              uint8_t           *cipher_text,
                              const uint8_t     *ad        , size_t ad_size,
                              const uint8_t     *plain_text, size_t text_size);
int crypto_stream_read_aead(crypto_stream_ctx *ctx,
                            uint8_t           *plain_text,
                            const uint8_t      mac[16],
                            const uint8_t     *ad        , size_t ad_size,
                            const uint8_t     *cipher_text, size_t text_size);

A few things to consider:

void crypto_stream_rfc8439_init(crypto_stream_ctx *ctx,
                                const uint8_t key[32], const uint8_t nonce[12])
{
    crypto_stream_init_raw(ctx, key, nonce + 4, load32_le(nonce));
}

void crypto_stream_djb_init(crypto_stream_ctx *ctx,
                            const uint8_t key[32], const uint8_t nonce[8])
{
    crypto_stream_init_raw(ctx, key, nonce, 0);
}
fscoto commented 2 years ago

I haven't actually used the API yet, so take the following with a grain of salt, but I'd like to know these ahead of time going in:

  1. Is the libsodium crypto_secretstream_xchacha20poly1305_* family compatible with the crypto_stream_* family? Considering a large amount of questions we get is about libsodium compatibility, I would assume this to be a consideration.
  2. Considering this is a high-level offering, are you sure you want a detachable authentication tag? The rest of Monocypher consistently has it detached, but they're also lower-level building blocks. My understanding of the purpose of this new interface is e.g. in network communications, agree on a secret key and then just have both sides talk to each other with the streaming interface, or e.g. in age-style file encryption, just dump the stream output and treat it as opaque. I'm not sure if a higher-level construction should detach the MAC, especially considering RFC 8439 § 2.8 formally defines that the AEAD is ciphertext followed by MAC.
  3. Why rfc8439 instead of ietf in the function name? This seems inconsistent with the other ChaCha20 functions.
snej commented 2 years ago

This looks reasonable to me (and similar to the libSodium API that I recently built a byte-oriented stream around.) I assume the behavior of crypto_stream_rekey is such that both the sender and receiver have to call it at the same "time" in the message stream?

LoupVaillant commented 2 years ago

@fscoto:

  1. No it's not. I considered it, but I disagree pretty strongly with Libsodium's design choices on the matter: its streaming interface is incompatible with its monolithic interfaces, and requires too much code to be implemented without significant runtime overhead.

  2. Actually, this is quite a low-level offering, to the point where the monolithic AEAD is implemented in terms of it. I couldn't do that without a detached tag. Another reason is consistency with the existing API. Another reason yet is buffer lengths: I'd rather have the size of the plaintext and ciphertext be the same, makes it easier to know how long each buffer must be. (If I could go back in time, maybe I would reattach the tag everywhere, though to be honest I'm not sure.)

  3. Ah, you're right, ietf is better. Blame the naming scheme I (mistakenly) chose for the test suite. Note however that this alternate init function will probably not be included: not only would it require more code, consistency would almost demand that we implement the corresponding monolithic versions as well, meaning 8 functions. Way too much code for too little benefit.


@snej:

Yes, both parties need to call rekey() at the same time. There's no tag for the recipient to detect & trigger rekey with, it must be decided manually by the application writer (either by convention, or by communicating the rekey order in-band).

LoupVaillant commented 2 years ago

Added some tests, and refined the design. I have managed to lift the size limits of chunks, at the cost of an increased rekey overhead (the automatic rekey occurs every 256 chunks instead of every 4 billions).

The API maintains an internal context with the following information:

There are 3 ways to initialise this context that makes some sense:

_(I have realised that the 12 byte nonce cannot be used for the incremental interface because of catastrophic nonce reuse problems with chunks beyond the first one, as well as rekeying. Because of this, the current design only provides the short and extended nonce versions, which do not have this problem. I removed the raw_init() function, which is impossible to document. That being said, full compliance with RFC 8439 is still possible for messages that use only one chunk. It requires poking at the internals of the context, but this beats copying & editing the entire AEAD code.)_

Nonce sizes have their usual pros & cons: short nonces have no overhead, but they cannot be random. Long nonces can be random, but they incur an additional HChaCha20 call. Once the context is initialised, we can start sending message chunks. Each chunk is built thus:

The AEAD works the same as RFC 8439, except with a 64-bit counter and a 64-bit nonce, and the start value of the counter is not always zero. The counter is then incremented, and if it cycles back to zero we rekey.

_With the chosen increment, we end up rekeying every 256 messages. This means one one additional HChaCha20 call per 256 messages. In the absolute worst case (empty messages), the overhead is less than 0.4%. In a reasonable worst case (short messages), the overheads halves down to 0.2%. In a typical streaming scenario with 32KiB chunks, even if we assume perfect vector instructions parallelism we get a 0.006% overhead, which I deem negligible. The reason for this big increment is because it ensures message chunk lengths are unlimited (the theoretical limit is 2^62 bytes, which is close enough to infinity)._

Finally, the rekey operation derives the next key from the current one:


With this design, I believe we have a versatile, limitless, low-overhead streaming API, that as a bonus even allows RFC 8439 compliance if we can tolerate a tiny little hack (I currently use that hack in the tests to verify RFC compliance). Though I'll still seek peer review, I'm fairly confident the current design is sound. The only thing left to do now is discuss tradeoffs.

LoupVaillant commented 2 years ago

One small change to the rekeying algorithm: we should reset the counter to zero.

This changes nothing for automatic rekeying, because the counter is already zero. However, manual rekeying can happen at any time, and when it happens, we can safely delay automatic rekeying to reduce overhead.

LoupVaillant commented 2 years ago

I had an epiphany. :sunglasses:

Rekeying can be achieved by copying 32 previously unused bytes. See, RFC 8439 requires one ChaCha20 call to generate the authentication key. Strangely enough, it does not use HChacha20, but regular Chacha20. This incurs a small overhead, but Monocypher conforms to it because it provides cheap compatibility with Libsodium. Thing is, a ChaCha20 block is 64 bytes, and the authentication key is only uses the first 32. The last 32 can be used however we want.

This is ideal for rekeying. And since the overhead is so low (just copy 32 bytes), we can afford to do it for every single message, which by the way makes the rekey() function redundant. The result is very cheap, automatic forward secrecy with a symmetric ratchet.

Even better, since we change the key on a systematic basis, we no longer need to mess with the counter. As a consequence, we now can use 96-bit nonces with multi-chunk messages without fearing any collision, so I can justify core support for those (and along them RFC 8439 compatibility).

Icing on the cake, users now have a (somewhat low-performance) fast key erasure user-space CSPRNG at their disposal. When they need new random bytes, they can just write a new message from a stream context, and the current seed will automatically be erased. With this poor man's RNG, the temptation to add a dedicated one plummets.

This design is also very close to full blown key commitment support: for single chunk messages, we can use the key as an additional tag instead of a key for the next chunk, and have the recipient compare that tag along with the authentication tag. I'm not sure this is worth the trouble however, because of the increased message size overhead.

And on top of it the whole thing is simpler. I saved 8 lines of code.


The API didn't change much: I've removed the now redundant rekey() function, and added crypto_stream_init_ietf():

void crypto_stream_init(crypto_stream_ctx *ctx,
                        const uint8_t key[32], const uint8_t nonce[24]);
void crypto_stream_init_djb(crypto_stream_ctx *ctx,
                            const uint8_t key[32], const uint8_t nonce[8]);
void crypto_stream_init_ietf(crypto_stream_ctx *ctx,
                             const uint8_t key[32], const uint8_t nonce[12]);

void crypto_stream_write(crypto_stream_ctx *ctx,
                         uint8_t            mac[16],
                         uint8_t           *cipher_text,
                         const uint8_t     *plain_text, size_t text_size);
int crypto_stream_read(crypto_stream_ctx *ctx,
                       uint8_t           *plain_text,
                       const uint8_t      mac[16],
                       const uint8_t     *cipher_text, size_t text_size);

void crypto_stream_write_aead(crypto_stream_ctx *ctx,
                              uint8_t            mac[16],
                              uint8_t           *cipher_text,
                              const uint8_t     *ad        , size_t ad_size,
                              const uint8_t     *plain_text, size_t text_size);
int crypto_stream_read_aead(crypto_stream_ctx *ctx,
                            uint8_t           *plain_text,
                            const uint8_t      mac[16],
                            const uint8_t     *ad        , size_t ad_size,
                            const uint8_t     *cipher_text, size_t text_size);

Internally, initialisation is unchanged, but the way chunks are generated is simplified:

ciphertext, tag = AEAD(ctx.key, ctx.nonce, ctx.counter)
ctx.key         = free_bytes_from_AEAD

That simple. If I zoom in a little and detail the AEAD we get this:

auth_key, new_key = ChaCha20(ctx.key, ctx.nonce, ctx.counter, zeroes, 64)
ciphertext        = ChaCha20(ctx.key, ctx.nonce, ctx.counter + 1, plaintext, message_size)
tag               = AUTH(auth_key, ad, ad_size, ciphertext, text_size);
ctx.key           = new_key

The AUTH() step is directly taken from RFC 8439. The new_key is basically free, because a ChaCha block is 64 bytes anyway.

fscoto commented 2 years ago

So essentially, you're doing away with the rekey method because every stream chunk is an immediate re-key anyway at the cost of an extra 32-byte copy for each message. Do you have your RPi on hand? I wonder how bad the extra copy expenses end up being on a smaller platform.

Going back to try and rewrite fk to use the new API, I notice that I'm missing libsodium's crypto_secretstream tag handling. Namely, I'm missing a way to signal end-of-stream. The current API works well enough for a network-based API where the connection just gets dropped or there's a protocol-level "we are going to stop communicating" message. However, you cannot detect a premature end-of-stream with the current API, i.e. the size of the entire steam can only be effectively authenticated with user-level code rather than library code and involves non-trivial extra effort in the form of branches and relatively fragile additional code that is easy to forget to test.

Was this part of your threat model?

LoupVaillant commented 2 years ago

So essentially, you're doing away with the rekey method because every stream chunk is an immediate re-key anyway at the cost of an extra 32-byte copy for each message.

Yup.

Do you have your RPi on hand? I wonder how bad the extra copy expenses end up being on a smaller platform.

I don't have an R-Pi, but we do have the compiler explorer. Here's what the extra COPY() loop turns into under -O2 with the latest GCC:

        add     lr, sp, #72
        add     ip, sp, #40
        ldmia   lr!, {r0, r1, r2, r3}
        str     r0, [r7, #8]      @ unaligned
        str     r1, [r7, #12]     @ unaligned
        str     r2, [r7, #16]     @ unaligned
        str     r3, [r7, #20]     @ unaligned
        ldmia   lr!, {r0, r1, r2, r3}
        str     r0, [r7, #24]     @ unaligned
        str     r1, [r7, #28]     @ unaligned
        str     r2, [r7, #32]     @ unaligned
        str     r3, [r7, #36]     @ unaligned

-O3 is basically identical (it maybe saves one add), but -Os does keep a loop:

       add     r6, sp, #64
[...]
        mov     r3, r5
.L71:
        mov     r2, r6
        adds    r3, r3, #8
        ldmia   r2!, {r0, r1}
        str     r0, [r3, #-8]     @ unaligned
        str     r1, [r3, #-4]     @ unaligned
        mov     r6, r2
        cmp     r2, r4
        bne     .L71

In both cases we're copying word by word, which should be quite a bit faster than going byte by byte. Overall, I don't expect the slowdown will be noticeable, except perhaps for empty messages (and even then we still process one Chacha20 block and one Poly1305 block for the authentication tag).

Was this part of your threat model?

Oops, it was not to be honest. This tag looked like an unnecessary complication, so I just scraped it without much thought. That said, I think we have an ace up our sleeve: the additional data. We just need to add something to it at the last block. If the streams cuts of too soon (or too late!), the recipient will have the wrong additional data for its last block, and decryption will automatically fail.

#include <monocypher.h>

#define BLOCK_SIZE (32 << 10)
uint8_t END_TAG[4] = "LAST";

void encrypt(uint8_t *out, size_t *out_size,
             const uint8_t key[32], const uint8_t nonce[24],
             const uint8_t *in, size_t in_size)
{
    *out_size = 0;
    crypto_stream_ctx ctx;
    crypto_stream_init(&ctx, key, nonce);
    uint64_t nb_blocks = (in_size + BLOCK_SIZE -1) / BLOCK_SIZE;
    for (uint64_t i = 0; i < nb_blocks - 1; i++) {
        crypto_stream_write(&ctx, out, out+16, in, BLOCK_SIZE);
        in        += BLOCK_SIZE;
        in_size   -= BLOCK_SIZE;
        *out_size += BLOCK_SIZE + 16;
    }
    crypto_stream_write_aead(&ctx, out, out+16, END_TAG, 4, in, in_size);
}

int decrypt(uint8_t *out, size_t *out_size,
            const uint8_t key[32], const uint8_t nonce[24],
            const uint8_t *in, size_t in_size)
{
    *out_size = 0;
    crypto_stream_ctx ctx;
    crypto_stream_init(&ctx, key, nonce);
    uint64_t nb_blocks = (in_size + BLOCK_SIZE + 15) / (BLOCK_SIZE + 16);
    for (uint64_t i = 0; i < nb_blocks - 1; i++) {
        if (crypto_stream_read(&ctx, out, in, in+16, BLOCK_SIZE)) {
            return -1;
        }
        in        += BLOCK_SIZE + 16;
        in_size   -= BLOCK_SIZE + 16;
        *out_size += BLOCK_SIZE;
    }
    return crypto_stream_read_aead(&ctx, out, in, END_TAG, 4, in+16,BLOCK_SIZE);
}

There, no additional branch, and I believe very little additional code overall. How would that translate to your file encryption utility?

fscoto commented 2 years ago

Doesn't this handling of the end tag imply that you know and can transmit the size of the entire stream ahead of time so that you could just authenticate the entire message length as AD in the first block and just error out then? I'm more concerned about scenarios where the full size of the streamed data isn't known ahead of time, such as when JSON or encoded ASN.1 output is streamed out on the fly.

Re fk in particular: I wanted to avoid having to determine the file size ahead of time and just stream the chunks in since fseeko() isn't part of the C standard library and ftell() only returns int breaking large files. I guess it ends up working, just not in as much standards purity as I hoped for.

LoupVaillant commented 2 years ago

OK, in this case there is no escaping the overhead associated with transmitting that tag across the wire. Libsodium by the way does it quite inefficiently, reserving an entire additional ChaCha20 block for the tag, then authenticating all 64 bytes (16 would have sufficed, the only reason this is not too slow is vector instructions). I'm not sure there's an ideal solution to be honest. Right now I see three approaches:

With the interface I'm proposing, I believe all three approaches are possible. They're not trivial, but if we chose & provided one out of the box we'd paint ourselves into a corner just like Libsodium, which have etched their overhead & extra complexity to the wire format itself. I'd like to avoid that. Some use cases will likely benefit from one approach more than another, and others won't even need the overhead. I want to allow them all if possible.

That being said, I'll need to write actual code to get a better feel for this API. One core design goal after all is for it to be suitable for file formats & network protocols, so this has to be tested. And though I'm currently fairly confident about the performance, simplicity, and versatility of my design, further tests may yet send me back to the drawing board.

LoupVaillant commented 2 years ago

Yet another approach just occurred to me:

ewd340 commented 2 years ago

First of all, thank you for your work on Monocypher. I like how clean it is.

I've read with interest this thread and wanted to chime in to tell you that I came across Monocrypt, a WIP by @skeeto, that uses Monocypher to encrypt/decrypt files using the high level crypto_lock and crypto_unlock primitives of Monocypher.

This tool doesn't need to know à priori the file's size. So I thought that it may give you some inspiration for this streaming interface.

Again, thank you for your work, and sorry if my comment is out of topic.

LoupVaillant commented 2 years ago

Thanks for the heads up @ewd340, I’ll take a look.

ewd340 commented 2 years ago

I finally got some time tp play with the streaming interface (from the stream branch), and wrote a very basic utility (heavily inspired by Monocrypt). Now, I have to admit, I'm not a cryptographer, but this exercise was fun. I have tried it (not heavily tested) on a small 'hello world' text file, and on a 360Mb video file, and it seems to work.

#define CHUNKLEN (128 << 10) - 16
#define VERSION 0
uint8_t END_TAG[4] = "LAST";

enum error {ERR_OK, ERR_KEYFILE, ERR_READ, ERR_WRITE, ERR_TRUNC, ERR_INVALID};

I have defined a small header, that could have been only a nonce actually:

struct aead_header {
    uint8_t nonce[24];
    uint8_t version;
};

The encryption function:

static enum error fencrypt(FILE *in, FILE *out, const uint8_t key[32]){
    int eof;
    int len;
    uint8_t *ad;
    size_t adlen;
    enum error err = ERR_OK;

    struct aead_header header;
    uint8_t buf_in[CHUNKLEN];
    uint8_t buf_out[CHUNKLEN+16];

    getrandom(header.nonce, 24, GRND_NONBLOCK);
    header.version = VERSION;

    crypto_stream_ctx ctx;
    crypto_stream_init(&ctx, key, header.nonce);

    if (!fwrite(&header, sizeof(header), 1, out)) {
        return ERR_WRITE;
    }

    do{
        len = fread(buf_in, 1, CHUNKLEN, in);
        if (!len && ferror(in)) {
            err = ERR_READ;
            break;
        }
        eof = feof(in);
        ad = (eof)?END_TAG:0;
        adlen = (eof)?4:0;
        crypto_stream_write_aead(&ctx, buf_out, buf_out+16, ad, adlen, 
                                 buf_in, len);

        if (!fwrite(buf_out, 1, (size_t) len+16, out)) {
            err = ERR_WRITE;
            break;
        }  
    } while(!eof);

    return err;
}

And the decryption one:

static enum error fdecrypt(FILE *in, FILE *out, const uint8_t key[32])
{
    int eof;
    int len;
    uint8_t *ad;
    size_t adlen;
    enum error err = ERR_OK;

    struct aead_header header;
    uint8_t buf_out[CHUNKLEN];
    uint8_t buf_in[CHUNKLEN+16];

    crypto_stream_ctx ctx;

    /* Read the header to get some info about the encryption: nonce, ver... */
    if (!fread(&header, 1, sizeof(header), in)) {
        return ferror(in) ? ERR_READ : ERR_TRUNC;
    }

    crypto_stream_init(&ctx, key, header.nonce);

    do {
        len = fread(buf_in, 1, CHUNKLEN+16, in);

        if (!len && ferror(in)) {
            err = ERR_READ;
            break;
        }

        eof = feof(in);
        ad = (eof)?END_TAG:0;
        adlen = (eof)?4:0;

        if(crypto_stream_read_aead(&ctx, buf_out, buf_in, ad, adlen, 
                                   buf_in+16, len-16)) {
            err = ERR_INVALID;
            break;
        }

        if(!fwrite(buf_out, (size_t) len-16, 1, out)) {
            err = ERR_WRITE;
            break;
        }
    } while(!eof);

    return err;
}

Again, this was just a small POC, please let me know if I'm missing something.

snej commented 2 years ago

For what it's worth, my network-stream code is in my secret-handshake-cpp repo.

There are two layers: CryptoBox works with chunks of encrypted data, reading/writing an entire chunk; and CryptoStream is a byte-oriented stream interface that reads and writes CryptoBoxes.

This has not been heavily used yet, but it works well enough to support the CapnProto RPC protocol on a TCP socket.

LoupVaillant commented 2 years ago

@ewd340, though it may have been intentional for readability, you may have missed one thing: getrandom(2) is a low-level system call that is allowed to fail. Worse, it may not give you as many random bytes as you want even if it succeeds. I would recommend arc4random_buf(3) instead if it is available. If not, maybe implement it yourself with getrandom(2) under the hood, such that the only possible outcomes are success or a clean crash.

The way you use additional data to signal the end of the stream and avoid truncation attacks looks correct. It's also simple and cheap, I like it.

LoupVaillant commented 2 years ago

@snej, thanks for the reference, I'll take a look, see if what I'm proposing here could have been used to simplify your code.

ewd340 commented 2 years ago

Thank you so much for the feedback, @LoupVaillant! I didn't honestly pay enough attention to the fact that the getrandom(2) primitive may fail for buffer lengths less than 255. Yet, yes, about its return value, the man clearly states:

The careful programmer will check for this anyway!

Would something like this do the trick?

static enum error random_buf(uint8_t *buf, size_t buflen)
{
    enum error err = ERR_OK;
    if (buflen > 255) {
        err = ERR_NO_RANDOM;
    } else {
       ssize_t len = getrandom(buf, buflen, 0);
       err = (len ==  (ssize_t)buflen)?ERR_OK:ERR_NO_RANDOM;
    }
    return err;
}
ewd340 commented 2 years ago

Speaking of this interface, I got confused a little bit by the parameters order of the functions:

void crypto_stream_write_aead(crypto_stream_ctx *ctx,
                              uint8_t            mac[16],
                              uint8_t           *cipher_text,
                              const uint8_t     *ad        , size_t ad_size,
                              const uint8_t     *plain_text, size_t text_size);
int crypto_stream_read_aead(crypto_stream_ctx *ctx,
                            uint8_t           *plain_text,
                            const uint8_t      mac[16],
                            const uint8_t     *ad        , size_t ad_size,
                            const uint8_t     *cipher_text, size_t text_size);

Not a big deal, it is easy to read indeed, but I somewhat expected to have the same order (especially for the mac)

fscoto commented 2 years ago

@ewd340 If I may... I have some input on getrandom and its relatives.

First of all: The cutoff is at 256 bytes, not 255 bytes. For everything else, read on.

getrandom(2) started as a Linux-specific syscall, but there are vaguely compatible interfaces in other systems, namely at least FreeBSD, Solaris and illumos (but not OpenBSD or macOS, more on those later). There is no standard to adhere to.

The 256-byte limitation is arbitrary (though likely there to avoid spending too much time in the kernel per syscall) and you may get a larger buffer filled but only partially. Since there is no real distinction between a syscall and a “normal” C standard library function, an operating system is free to implement a semantically identical userspace getrandom(3) instead, too. I therefore don't think you should guard the size limit pre-emptively. Just let the call fail and go from there. I would therefore instead write approximately this (snippet hereby placed under the CC0-1.0):

static enum error random_buf(uint8_t *buf, size_t buflen)
{
    size_t to_read;
    ssize_t len;

    while (buflen != 0) {
        to_read = buflen;
        len = getrandom(buf, to_read, 0);

        if (len > 0) {
            buf += len;
            buflen -= len;
        }

        if (len == -1 || (size_t)len < to_read) {
            switch (errno) {
            case EAGAIN:
            case EINTR:
                continue;
            default:
                return ERR_NO_RANDOM;
            }
        }
    }
    return ERR_OK;
}

(Since there is no standard, any implementation may also fail when less bytes are requested, but since most platforms will add it for compatibility with Linux, there is a natural incentive not to lower the maximum number of bytes returned.)

On a similar note: getrandom(2) is somewhat new, introduced in Linux 3.17 (2014); perhaps some embedded platform's vendor kernel doesn't have it. glibc added support in 2.25 (2017). If you have a new glibc but an old kernel, glibc may return ENOSYS, which is a permanent, non-user error, even though it may look like getrandom(2) exists on a naive check. Consider ensuring a fallback at runtime or a thorough compile-time functionality check in your build system if possible.


Miscellaneous portability notes born from frustration:

OpenBSD or macOS have not implemented getrandom(2). They use getentropy(2). The interface is intended to seed (cryptographically secure) RNGs in userspace with kernel randomness; e.g. OpenBSD arc4random(3) seeds itself with it. Applications are expected to directly call arc4random(3) instead. While almost all of the platforms mentioned thus far implement arc4random(3), there is one major platform that does not: Linux/glibc.

POSIX will, in a future issue, standardize getentropy() only.. There is no help coming by ways of the standards committees of the world.

A portable solution would, therefore, have to check in this order and at least partially at build time:


About Windows: Windows has a crypto API and it expects you to use it; getting random bytes involves setting up a cryptography context and then calling a function called BCryptGenRandom that uses it to generate random bytes. (The function also has "BCrypt" in its name, yet has nothing to do with the bcrypt hash algorithm.)

This isn't strictly necessary, however. Some people work around this limitation by calling into the deprecated RtlGenRandom() function, which doesn't require keeping state. This leads hacks such as the one found in libsodium.

ewd340 commented 2 years ago

Thanks a lot @fscoto for your very instructive input, I liked every bit of it! Thanks also for the better random_buffunction.

Re: portable solutions, I appreciate the details you provided, I wish every OS would just provide arc4random, but that's not the case, and I guess we'll have to deal with this reality. I, in fact, thought to myself that since this random_bufwill be used (in my case) only for generating a nonce, that maybe I could use a PRNG that would be seeded with ---pardon my naivety--- the current time?

LoupVaillant commented 2 years ago

maybe I could use a PRNG that would be seeded with ---pardon my naivety--- the current time?

Goodness, no! (Edit: Oops, my mistake, I misunderstood the question.)

Current time is way too predictable. Not enough entropy to be extracted from it. Even if you get down to the millisecond, an attacker that can guess when you performed the seed down to the second will get only 10 bits of entropy to search through. And I’m being reasonable, the best attacks may be even better.

If you have an OS that can give you randomness, you should use it. If the system API is too horrible, you may initialise a user space RNG when the program starts up, and after that no more errors. User space RNGs are error prone though, multithreading and all that, so be careful.

And sometimes, you don’t even get randomness. Perhaps not even jitter entropy if you’re using a single process in-order embedded CPU. In that case, you need to come up with 32 random bytes from another source (like your desktop), record that seed in your chip, and seed your RNG with that. And of course, make sure you renew the seed every time you reboot, so you don’t end up reusing the same "random" stream over and over.

ewd340 commented 2 years ago

Thanks again @LoupVaillant and sorry if I'm diverting the discussion from its main topic with my randomness issues. I enjoy the very informative input I'm getting from you, guys.

The" Goodness, no!" made me chuckle... I saw it coming :) Still, I thought that, since the nonce need not be unpredictable (only unique per key, if I have understood correctly), I can get away with something basic. Anyways... even for a nonce, it won't hurt to have good entropy... so my laziness is not justified, I admit.

Back to the interface, again not a big deal, but I humbly think that the mac argument in crypto_stream_read_aead should be in the same position as in crypto_stream_write_aead. That is, something like this:

void crypto_stream_write_aead(crypto_stream_ctx *ctx,
                              uint8_t            mac[16],
                              uint8_t           *cipher_text,
                              const uint8_t     *ad        , size_t ad_size,
                              const uint8_t     *plain_text, size_t text_size);
int crypto_stream_read_aead(crypto_stream_ctx *ctx,
                            const uint8_t      mac[16],
                            uint8_t           *plain_text,
                            const uint8_t     *ad        , size_t ad_size,
                            const uint8_t     *cipher_text, size_t text_size);

Or is there a reasoning behind the proposed order that I don't get?

LoupVaillant commented 2 years ago

Still, I thought that, since the nonce need not be unpredictable (only unique per key, if I have understood correctly), I can get away with something basic.

Oh, sorry, I misunderstood. I shouldn't haven answered when I was this tired. Yes, a nonce doesn't have to be unpredictable, only unique. If you have a reliable clock (that never display the same hour), you can use it. There's one big caveat though: make sure you don't run two encryptions at the same time.

Alternatively, you can maintain a counter, but the same reuse misuse that applies with userspace RNGs will apply to that counter.

Or is there a reasoning behind the proposed order that I don't get?

There is: in all of Monocypher, the arguments go in the following order:

And yes, sometimes this particular consistency sacrifices other kinds of consistency. Your proposal makes sense, it's just not the logic I chose to follow. Also, note the order of arguments for the current AEAD API:

void crypto_lock_aead(uint8_t        mac[16],
                      uint8_t       *cipher_text,
                      const uint8_t  key[32],
                      const uint8_t  nonce[24],
                      const uint8_t *ad        , size_t ad_size,
                      const uint8_t *plain_text, size_t text_size);
int crypto_unlock_aead(uint8_t       *plain_text,
                       const uint8_t  key[32],
                       const uint8_t  nonce[24],
                       const uint8_t  mac[16],
                       const uint8_t *ad         , size_t ad_size,
                       const uint8_t *cipher_text, size_t text_size);

Same logic, similar order. (We have a key and nonce instead of the context, so they don't come first, but the rest of the arguments are in the same order.)

(Edit: Alternatively, I could swap mac and ciphertext in the read() function and still preserve my logic, but it still wouldn't be consistent with the existing functions.)

ewd340 commented 2 years ago

There is: in all of Monocypher, the arguments go in the following order:

* context

* output arguments

* input arguments

This is it! I kind of felt it, but failed to see it. Thanks again.

LoupVaillant commented 1 year ago

The streaming interface has been added, and the designed gelled to what I believe to be a satisfactory design.

There is one thing I must explain that's not in the message commit: the choice to not imitate libsodium. There are two reasons:

  1. With my design, the direct interface is implemented in terms of the streaming interface. All fully compatible, and users can have full RFC compliance on top. It also requires less code, makes everything simpler… the works.
  2. Libsodium provides an additional tag that I don't. One reason I don't is the systematic overhead incurred by their tag (each chunk is 1 byte bigger, and in Libsodium case's requires processing an additional Chacha block). This tag is not needed in all cases. Very often it can be transmitted implicitly by knowing the size of the input in advance, or simply by closing the connection (and change the additional data for the last block). In cases where that's not possible (can't close the connection, can't know the size in advance…), users can implement this tag on top of this interface.

Oh, and about this key commitment thing an partition oracle attacks: that should be addressed by committing the key at the beginning of the stream, typically by outputting a hash derived from the same seed as the encryption key.

With this, I believe we are done.