Closed colmmacc closed 9 years ago
I've spent the week or two reading up in more detail on current developments with PRNGs, thanks to Stan from Intuit for sending me some good references. Based on everything I've read so far I'd like to start with a proposal.
s2n uses randomly-generated data in a few places:
In all but the case of Diffie Hellman parameters, the entropy data is effectively "public" and is visible directly on the wire in plaintext form. The theoretical worst case for entropy use is a cipher suite such as TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA which will require all 5 types of randomly-generated data usage. In particular, because each record requires randomly generated data (for the explicit IV), our source of entropy can quickly become the main performance bottleneck in s2n.
I've spent some time profiling both reads from /dev/urandom and the getrandom()/getentropy() system calls, I couldn't find a significant performance difference. In all cases they perform poorly when faced with many contended reads from different threads. Even if these patterns could be made lock-free within the kernel, system calls come with their own performance penalties: even the very fastest system calls seem to incur a ~10 microsecond penalty, which is quite a lot in the context of sending a TLS record.
Still getrandom()/getentropy() are better to use, where available, as they avoid fatal error cases around mis-configured chroot jails and file descriptor exhaustion.
An early version of s2n had a thread-local 4K buffer/cache of data read from /dev/urandom. This drastically reduces the contention on the kernel, but it has two other problems:
In s2n we have s2n_connection objects, which application authors are encouraged to keep pools of and re-use across many TCP connections/sessions. Part of the s2n API contract is that a connection may only be safely accessed from two threads at a time: a single reader and a single writer.
This makes the s2n_connection object a good place to keep some state around entropy generation. I'd like to propose adding two new pieces of state to the object:
struct s2n_connection {
...
s2n_entropy_generator public;
s2n_entropy_generator private;
};
each entropy generator would be periodically seeded from /dev/urandom, or getrandom/getentropy() as appropriate, and would be used to for any randomly generated data associated with that s2n_connection.
Seeding would occur when s2n_connection_new() is called, and when a hard-coded limit on the entropy volume is hit. To avoid performance stalls, we could choose to reseed as part of s2n_connection_wipe(), which is out of line with the protocol handling phase. So for example: we might put a hard limit of 262 bytes before reseeding MUST occur, but opportunistically perform it within s2n_connection_wipe if more than 260 bytes have been generated.
Additionally, we will use minherit() and pthreat_atfork() to zero a small piece of global state. reseeding will also be forced whnever this global state is detected to be zero.
Since there is still some small risk of a very broken application fork()/clone()-ing and copying s2n_connection object state between processes, we do our best to be defensive against this by never leaking the "private" entropy we use (e.g. DH parameters) and the "public" entropy that is visible on the wire.
Since the data we randomly generate must be cryptographically secure and unpredictable, even when the previous results are public, we must use a CSRNG. arc4random, yarrow, the DRBG family are all examples. Since we would like to add support for ChaCha20 as a streaming cipher, and ChaCha20 can be used as a CSRNG, I'm proposing that we use ChaCha20 here. ChaCha20 is available in the Public Domain, and BSD licensed references are available on using ChaCha20 as a CSRNG (see http://cvsweb.netbsd.org/bsdweb.cgi/src/lib/libc/gen/arc4random.c?rev=1.28&content-type=text/x-cvsweb-markup for example).
A brief summary of the algorithm for using ChaCha20 as a CSRNG is:
s is the initial 32-byte (256 bits) seed
compute the 64-byte quantity x = ChaCha20_s(0)
split x into two 32-byte quantities s' and k
replace s by s', and
use k as output
ChaCha20 is presently in both LibreSSL and BoringSSL, though not yet mainline OpenSSL 1.0.2 (but it should be shortly).
The downside of this is more memory and state usage. ChaCha20 requires about 128 bytes of state, so we would be adding around 256 bytes of state to each connection object.
"Public" data can be used directly, even with ECDSA signatures, and no special tricks are required. For the private data, we need to over-ride libcrypto's random method. This is supported by both LibreSSL and OpenSSL, but not BoringSSL (who have removed the ability to over-ride entirely). Thus BoringSSL would use its own, slower, buffer-based system for the private random data. Since there is no risk of overlap between the streams in any cases, this seems acceptable.
I've got an experimental local branch part way to implementing the above proposal, but in the process I've realised that it's not possible to use different RNGs within libcrypto on a per-connection basis without being non-reentrant. It is possible to set a new default ENGINE for handling RAND, and internally it boils down to a small number of pointer manipulations so it could be speedy, but if we were interrupted by a signal the random method would left set pointing at the per-connection method.
Because of that, pragmatically the best thing may be to use a single, thread-local generator for the private libcrypto generated-data (which is DH parameters).
The #61 PR addresses this issue
Right now s2n uses /dev/urandom for all entropy, per the example of the excellent NaCL library, best summarized by Thomas Ptacek: http://sockpuppet.org/blog/2014/02/25/safely-generate-random-numbers/ .
The performance problem with urandom
The problem with using /dev/urandom though is that it is slow; at present it is the main (and really only) performance bottleneck in s2n. Reading from /dev/urandom involves kernel level locks and context switching. A quick benchmark shows that it's about a hundred times slower than using an in-process PRNG.
Will a PRNG work?
A modern PRNG should be safe to generate unpredictable and randomly distributed data once it has been seeded with about 256 bits of meaningful entropy, an in-process PRNG can be secure in that sense.
Unfortunately PRNGs are also inherently stateful and it is very difficult to guarantee that the internal state won't be replicated between processes. C99 does offer strong guarantees around thread local storage and it's relatively easy to ensure that two threads are seeded independently, but so far I've found no ironclad way to ensure that a fork() , or worse ... syscall(SYS_fork...) won't result in two processes having identically seeded PRNGs. This is particularly bad with TLS, because randomly-distributed data can sometimes be public (explicit IVs) and sometimes be very sensitive and private (PFS key generation).
One initially promising solution is to guard access to the PRNG with a getpid() call, and to detect changes in PID and reseed when needed. Unfortunately this also turns out not to be reliable: http://yarchive.net/comp/linux/getpid_caching.html libc can cache the PID value and get it wrong in some circumstances.
Why this matters
Since it's feasible that s2n will be used inside language runtimes (e.g. the JVM, Ruby, Python ..), it's not unrealistic to expect the calling application to call fork(), vfork() or even syscall(SYS_fork) , it's best to protect against this in a robust way.
Potential solutions
If getentropy()/getrandom() have better locking/performance semantics, using them may help, but these are not widely available. Another solution maybe to use the RDRAND instruction directly where available, but this comes with its own challenges. See also djb's write up at: http://blog.cr.yp.to/20140205-entropy.html.