aws / s2n-tls

An implementation of the TLS/SSL protocols
https://aws.github.io/s2n-tls/usage-guide/
Apache License 2.0
4.53k stars 711 forks source link

Issues around including a PRNG #26

Closed colmmacc closed 9 years ago

colmmacc commented 9 years ago

Right now s2n uses /dev/urandom for all entropy, per the example of the excellent NaCL library, best summarized by Thomas Ptacek: http://sockpuppet.org/blog/2014/02/25/safely-generate-random-numbers/ .

The performance problem with urandom

The problem with using /dev/urandom though is that it is slow; at present it is the main (and really only) performance bottleneck in s2n. Reading from /dev/urandom involves kernel level locks and context switching. A quick benchmark shows that it's about a hundred times slower than using an in-process PRNG.

Will a PRNG work?

A modern PRNG should be safe to generate unpredictable and randomly distributed data once it has been seeded with about 256 bits of meaningful entropy, an in-process PRNG can be secure in that sense.

Unfortunately PRNGs are also inherently stateful and it is very difficult to guarantee that the internal state won't be replicated between processes. C99 does offer strong guarantees around thread local storage and it's relatively easy to ensure that two threads are seeded independently, but so far I've found no ironclad way to ensure that a fork() , or worse ... syscall(SYS_fork...) won't result in two processes having identically seeded PRNGs. This is particularly bad with TLS, because randomly-distributed data can sometimes be public (explicit IVs) and sometimes be very sensitive and private (PFS key generation).

One initially promising solution is to guard access to the PRNG with a getpid() call, and to detect changes in PID and reseed when needed. Unfortunately this also turns out not to be reliable: http://yarchive.net/comp/linux/getpid_caching.html libc can cache the PID value and get it wrong in some circumstances.

Why this matters

Since it's feasible that s2n will be used inside language runtimes (e.g. the JVM, Ruby, Python ..), it's not unrealistic to expect the calling application to call fork(), vfork() or even syscall(SYS_fork) , it's best to protect against this in a robust way.

Potential solutions

If getentropy()/getrandom() have better locking/performance semantics, using them may help, but these are not widely available. Another solution maybe to use the RDRAND instruction directly where available, but this comes with its own challenges. See also djb's write up at: http://blog.cr.yp.to/20140205-entropy.html.

colmmacc commented 9 years ago

Entropy in s2n, an update and a proposal

I've spent the week or two reading up in more detail on current developments with PRNGs, thanks to Stan from Intuit for sending me some good references. Based on everything I've read so far I'd like to start with a proposal.

Quick refresh on some of the background

s2n uses randomly-generated data in a few places:

In all but the case of Diffie Hellman parameters, the entropy data is effectively "public" and is visible directly on the wire in plaintext form. The theoretical worst case for entropy use is a cipher suite such as TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA which will require all 5 types of randomly-generated data usage. In particular, because each record requires randomly generated data (for the explicit IV), our source of entropy can quickly become the main performance bottleneck in s2n.

Can we simply use /dev/urandom or genrandom()/getentropy() ?

I've spent some time profiling both reads from /dev/urandom and the getrandom()/getentropy() system calls, I couldn't find a significant performance difference. In all cases they perform poorly when faced with many contended reads from different threads. Even if these patterns could be made lock-free within the kernel, system calls come with their own performance penalties: even the very fastest system calls seem to incur a ~10 microsecond penalty, which is quite a lot in the context of sending a TLS record.

Still getrandom()/getentropy() are better to use, where available, as they avoid fatal error cases around mis-configured chroot jails and file descriptor exhaustion.

Can we use a cache to alleviate the bottleneck?

An early version of s2n had a thread-local 4K buffer/cache of data read from /dev/urandom. This drastically reduces the contention on the kernel, but it has two other problems:

  1. It means that performance could be quite peaky: occasionally, when the buffer runs out, an s2n operation would appear to pause - which is not ideal.
  2. The buffer was not fork() safe. Although the BSD/Mac family have the minherit() system call which can be used to mark areas of memory that should be zeroed on fork(). But this isn't portable to Linux, so we have no robust mechanism to ensure that the same buffer isn't being used in two processes.

    The pthread_atfork callback can be used, but clone() can be called directly which bypasses it. getpid() can be used as a guard: but this does not protect against PID caching, or against "double fork" problems where a grand-child process may end up with a duplicate pid. getpid() and getppid() can be used together as a guard, but this is not safe against "triple fork" problems (although a grandchild and a great grandchild each having pid collisions is unlikely, it is possible) and both introduce all of the context switch problems that come with system calls.

Proposal: move to a per-connection CSRNG streams

In s2n we have s2n_connection objects, which application authors are encouraged to keep pools of and re-use across many TCP connections/sessions. Part of the s2n API contract is that a connection may only be safely accessed from two threads at a time: a single reader and a single writer.

This makes the s2n_connection object a good place to keep some state around entropy generation. I'd like to propose adding two new pieces of state to the object:

struct s2n_connection {
    ...
    s2n_entropy_generator public;
    s2n_entropy_generator private;
};

each entropy generator would be periodically seeded from /dev/urandom, or getrandom/getentropy() as appropriate, and would be used to for any randomly generated data associated with that s2n_connection.

Seeding would occur when s2n_connection_new() is called, and when a hard-coded limit on the entropy volume is hit. To avoid performance stalls, we could choose to reseed as part of s2n_connection_wipe(), which is out of line with the protocol handling phase. So for example: we might put a hard limit of 262 bytes before reseeding MUST occur, but opportunistically perform it within s2n_connection_wipe if more than 260 bytes have been generated.

Additionally, we will use minherit() and pthreat_atfork() to zero a small piece of global state. reseeding will also be forced whnever this global state is detected to be zero.

Why public and private?

Since there is still some small risk of a very broken application fork()/clone()-ing and copying s2n_connection object state between processes, we do our best to be defensive against this by never leaking the "private" entropy we use (e.g. DH parameters) and the "public" entropy that is visible on the wire.

Which PRNG should we use?

Since the data we randomly generate must be cryptographically secure and unpredictable, even when the previous results are public, we must use a CSRNG. arc4random, yarrow, the DRBG family are all examples. Since we would like to add support for ChaCha20 as a streaming cipher, and ChaCha20 can be used as a CSRNG, I'm proposing that we use ChaCha20 here. ChaCha20 is available in the Public Domain, and BSD licensed references are available on using ChaCha20 as a CSRNG (see http://cvsweb.netbsd.org/bsdweb.cgi/src/lib/libc/gen/arc4random.c?rev=1.28&content-type=text/x-cvsweb-markup for example).

A brief summary of the algorithm for using ChaCha20 as a CSRNG is:

s is the initial 32-byte (256 bits) seed
compute the 64-byte quantity x = ChaCha20_s(0)
split x into two 32-byte quantities s' and k
replace s by s', and
use k as output

ChaCha20 is presently in both LibreSSL and BoringSSL, though not yet mainline OpenSSL 1.0.2 (but it should be shortly).

Trade-offs

The downside of this is more memory and state usage. ChaCha20 requires about 128 bytes of state, so we would be adding around 256 bytes of state to each connection object.

Using our generators

"Public" data can be used directly, even with ECDSA signatures, and no special tricks are required. For the private data, we need to over-ride libcrypto's random method. This is supported by both LibreSSL and OpenSSL, but not BoringSSL (who have removed the ability to over-ride entirely). Thus BoringSSL would use its own, slower, buffer-based system for the private random data. Since there is no risk of overlap between the streams in any cases, this seems acceptable.

colmmacc commented 9 years ago

Update:

I've got an experimental local branch part way to implementing the above proposal, but in the process I've realised that it's not possible to use different RNGs within libcrypto on a per-connection basis without being non-reentrant. It is possible to set a new default ENGINE for handling RAND, and internally it boils down to a small number of pointer manipulations so it could be speedy, but if we were interrupted by a signal the random method would left set pointing at the per-connection method.

Because of that, pragmatically the best thing may be to use a single, thread-local generator for the private libcrypto generated-data (which is DH parameters).

colmmacc commented 9 years ago

The #61 PR addresses this issue