heartsucker commented 6 years ago

Feature request

Description

Currently we store source keys in the GnuPG keyring using their hashed codename as a UID and codename as a password. We keep the codename in the session in order to be able to reaccess the PGP keys.

Instead, we could add three fields to the sources table: key_salt, encrypted_key, public_key. These would be used with a PBKDF (maybe just scrypt again) for encrypting / decrypting a key. Alternatively, we could use the output of the PBKDF as the seed value for an ed25519 key and not store the private key at all.

This key would be used for encrypting/decrypting messages to/from the journalists. We could store the files on disk using just the source's id alone. The filesystem_id is stored in the DB anyway, so we're not obfuscating anything in this case (neither the PGP key nor the encrypted files on disk).

If we use the PDKDF to seed an ed25519 key, the key would only exist in memory while a source is interacting with SecureDrop. This would make it more difficult to recover journalist-to-source communications in the event of a breach.

It would also mean we could take the codename out of the session (related: https://github.com/freedomofpress/securedrop/issues/204).

Further, we could implement message signing using these keys so that if the server is breached, injecting false messages into the submissions table would be detected and rejected. This would stop an attacker from adding messages into past conversations in hopes of getting a journalist to reveal information that could de-anonymize the source.

This would not remove the submission PGP key from the workflow. It would still be used for the main encryption, but once the SD Workstation code is implemented, messaging would look like this:

pbkdf-sequence

The above sequence diagram shows the ideal case using the SD Workstation, but this approach would still be beneficial even with the current architecture.

User Stories

As one 'noided hombre, I don't want someone with access to a compromised SecureDrop to be able to read messages intended for sources.

As someone who is sometimes skeptical of GPG, I want to use simpler tools for handling message encryption/decryption.

As GCHQ, I want my dastardly plans foiled as often as possible because I'm basically useless.

psivesely commented 6 years ago

PBKDF (maybe just scrypt again)

scrypt is not a key derivation function, it is a sequential memory-hard function. If you read the paper there's no reduction of the pseudorandomness of scrypt(U_n) to the pseudorandomness of its underlying components [1]. Chapter 7 tries to argue that it can provide a strong key derivation function, but it lacks an actual reduction proof to the PRF it uses. That said, we don't actually need a key derivation function because the codename can already be represented as a pseudouniformly distributed string*, and key derivation functions are for generating pseudorandom strings from "key source material" (i.e., something not uniformly distributed, but that contains a good amount of randomness, and that an attacker may have partial knowledge of) [2].

If you want to derive a key from a codename the best way to do it is (a) use the codename-as-pseudouniformly-distributed-string to seed a RNG, and (b) use that RNG to deterministically generate a key. What's wrong with this approach is it assumes that the set of codenames is sufficiently large, which at ~90.5 bits (for the EFF English list), it's not. This is where memory-hard functions come in to slow brute-force attacks. What we really want is a seed that is both a pseudorandom string and can only be computed by a memory-hard function. Here's a simple example of how to generate such a seed:

seed = PRF(salt, codename-as-string) ⊕ MHF(salt, codename-as-string)

where PRF is any pseudorandom function and MHF is any memory-hard function and their output lengths are the same. Since a pseudorandom string XOR'd with an arbitrary equal length string produces a pseudorandom string, clearly our seed will be pseudorandom. Since to compute our seed, we need to compute a memory-hard function, this seed generation algorithm will be memory-hard. An instantiation of the PRF might be ChaCha20, keyed BLAKE2b, or KMAC and an instantiation of the MHF might be Argon2i.

I note it's crucial to separate the notions of a KDF, PRF, and MHF.

A benefit of generating a key like this that you don't mention is that this way we don't have to wait for journalist to flag a source and then the source to log back in before we generate them a key, which is a really bad user experience IMO. Clearly, this is not actually mutually exclusive with storing their encrypted private key on disk, the real take away is that the pseudorandomness of the key primes is based on the pseudorandomness of the codename, instead of generating an additional 4096 bits on the server**.

Instead, we could add three fields to the sources table: key_salt, encrypted_key, public_key. These would be used with a PBKDF (maybe just scrypt again) for encrypting / decrypting a key.

Generating a user-specific salt will not make it harder to guess a codename-derived key. Users are identified by a scrypt hash that is generated using an instance-specific salt. If an attacker gains access to the database, then they what they would do is guess random codenames, scrypt them with this instance-specific salt, and compare the result against all user identifiers. If there's a match, then they would use the user-specific salt and that codename to generate the private key for that user.

Without usernames, I don't see a way to make user-specific salts work. To use source-specific salts we need to associate a user identifier, but we can only store source identifiers that have been computed by MHFs, and to compute those identifiers we have to use a fixed-salt (since we don't know who the user is yet).

Therefore, nobody will pick a particular source ID and salt, and guess codenames until there is a public key match. Instead they will use the fixed-salt and match against identifiers because that way they increase their success chances by a factor of s where s is the total number of sources (not to mention the overhead on additionally running the key generation algorithm after running the MHF they avoid this way.)

Alternatively, we could use the output of the PBKDF as the seed value for an ed25519 key and not store the private key at all.

ed25519 is an instantiation of edDSA, which is for signatures. We do want to sign for reasons you mention, but we also need to be able to asymmetrically encrypt to the source's key. We can however use a single key for both decryption and signing, even if we weren't trusting the server to perform all the operations for us, and decryption and signing were implemented in signed JS served to the client [3] (see theorem 2).

That said, switching to ECC is an excellent idea. Especially if this key were to be derived in the clients browser each time, it's considerably faster to generate an ECC key. ECC is also faster for signing, verification, encryption, and decryption. Really the only downside is that it takes a much smaller quantum computer than RSA to break for the same classical security parameters.

If we use the PDKDF to seed an ed25519 key, the key would only exist in memory while a source is interacting with SecureDrop. This would make it more difficult to recover journalist-to-source communications in the event of a breach.

That's already true (if we make the unwarranted assumption that memory is overwritten as soon as the server is finished using the source's private key, which would apply to your suggestion as well). Unless you count the scrypt(salt, codename)-password encrypted private key on disk, which we assume to be useless to an attacker without the codename.

It would also mean we could take the codename out of the session (related: #204).

We can already take the codename out of the session by putting the private key in the client-side session instead. This was something I've been meaning to suggest, but haven't gotten around to. Two things to note here:

This that this saves a lot of computational effort on part of the server because scrypt is heavy and currently we have to run it every single time the source wants to do something with their private key (even just refreshing the submissions page). This will improve both user experience and server load.
The source's private key should be encrypted to a private key stored on the server before being put in the session. Specifically, a new private key should be generated on each initialization of the journalist web application, and stored in memory. This private key should also be used to encrypt the expires field (#1494) while we're at it.

Further, we could implement message signing using these keys so that if the server is breached, injecting false messages into the submissions table would be detected and rejected. This would stop an attacker from adding messages into past conversations in hopes of getting a journalist to reveal information that could de-anonymize the source.

If the server is breached they can serve any message they want journalists. The only way I can see for this to work is to develop a mechanism to store state on the JW (pretty sure there's ongoing progress in this direction). Identity of sources will then be correlated with public keys stored on the JW. Since the journalist codenames are the human identifiers that journalists use to distinguish sources, they will need to be stored JW-side as well.

As someone who is sometimes skeptical of GPG, I want to use simpler tools for handling message encryption/decryption.

You didn't explicitly mention this, but I assume you're saying that GPG should not be used on the server, but that (maybe) compatibility should be maintained for sources who wish to encrypt their messages with GPG before uploading them via the source interface. I agree that GPG is not the highest quality cryptography library, and suggest SD move to https://github.com/pyca/pynacl (potentially exclusively for all crypto needs unless there's some type of function missing), which bundles libsodium, a very highly regarded library.

* Pseudorandom generators work when they are seeded with a pseudorandom string in {0,1}*. Thus we need a way to make sure that the set of all possible codenames is of length 2^n for some n ϵ ℕ. Currently we have n ≈ 90.5 for the English wordlist. We can do this for the EFF wordlist by sampling a 90-bit pseudorandom string, encoding it in base 7,776, and padding the left side of the resulting base-7776 string with 0s if necessary so that it's of length num_words (= 7). Each word is then related with its line number [0,7775], and we pick the 7 words which correspond to the 7 values in our base-7776 string. This defines a simple bijection between codenames and pseudouniform strings.

** I'm skeptical if this is still necessary anyway because of changes to GPG mentioned in #1584.

[1] STRONGER KEY DERIVATION VIA SEQUENTIAL MEMORY-HARD FUNCTIONS https://www.tarsnap.com/scrypt/scrypt.pdf [2] Cryptographic Extraction and Key Derivation: The HKDF Scheme. https://eprint.iacr.org/2010/264.pdf [3] On the Joint Security of Encryption and Signature in EMV*. https://eprint.iacr.org/2011/615.pdf

One more thought: it may be possible to reduce the number of codewords a source codename needs. I would recommend switching to argon2i because among MHFs it gives the honest party the least relative performance hit versus the adversary, and is side-channel resistant. Regardless, here's the formula I would use to determine the minimum number of codewords:

log₂(|c|^n) + f = 128 + s

Where |c| is length of the codeword list, n is the number of codewords, 2^s is the total number of honest sources you expect for a single SD instance to serve, and 2^f is the number of operations (hash function invocations) the parameters we're using for our MHF require. Basically, we want a 128 bit security level, the + s factor on the r.h.s. is to take into account we're working in the single-function, multi-target model [4], and the + f factor on the l.h.s. takes into account how much harder the MHF makes brute-force attacks.

More specifically, you should first figure out what the highest parameters are you can set on your MHF while keeping its runtime below 100ms (this is generally considered reasonable) on the recommended SD hardware. Next, you figure out what f is based on these parameters, optimistically estimate s, and plugin the values to determine n, the number of codewords to use. Since MHFs were designed to protect passwords, which are much easier to guess than a 90 bit pseudouniformly distributed string, it seems reasonable that we might be able to cut off a couple words without risk (that said, we won't know until someone does the benchmarking and calculations). Reducing the number of codewords a source needs to memorize would really improve the user experience.

[4] Mitigating Multi-Target Attacks in Hash-based Signatures. https://eprint.iacr.org/2015/1256.pdf

Edit: updated instantiation recommendations to remove HMAC-based constructions (slow) and scrypt (bad tradeoffs compared to argon2i and suffers from side-channels), and to add ChaCha20 and BLAKE2b (both fast).

psivesely commented 5 years ago

seed = PRF(salt, codename-as-string) ⊕ MHF(salt, codename-as-string)

In practice this is probably safe, but you can create a contrived example such that you can't prove the scheme above is pseudorandom. MHF needs to be treated as an adversary A. If A outputs PRF(salt, codename-as-string), then seed is the zero string. So you need to keep the input to PRF secret from the input to A. This can be achieved by stretching the codename-as-string into two pre-seeds, seed1 and seed2, using a pseudorandom generator (PRG) first, then deriving seed as below.

seed1 || seed2 = PRG(codename-as-string) seed = PRF(salt1, seed1) ⊕ MHF(salt2, seed2)

Note, you may be able to set salt1 = salt2, but I'd have to think about it more. Anyway, I'm just correcting this for the sake of being correct lol.

While I'm here though, I'll again note that scrypt and even argon lacks a security reduction for memory hardness. OTOH, lately there's been a lot of recent research on "proofs of sequential work" or a "verifiable delay functions" that could be used as a better replacement imho, granted they probably need modifications for the side channel resistance that isn't required of their standard use cases.

I still think it's worth looking into whether you can make the codenames shorter safely as I mentioned before. Would be a UX win.

psivesely commented 5 years ago

Continuing this conversation with myself, it turns out that subsequently lower bounds for the memory hardness of Scrypt were proven in the parallel random oracle model: https://eprint.iacr.org/2016/989.pdf. That's re-assuring. To my knowledge, nobody has shown Scrypt is a PRF, but I bet that's also possible in the parallel random oracle model. In short, while there was theoretical substance to my initial objections, I believe in the interest of adhering to practical modern cryptographic practices its use can be considered sufficiently safe. That said, you'd still achieve better security guarantees on the quality of the private keys by instantiating the little scheme I proposed in my last comment with a PRG with a security reduction in the standard model.

freedomofpress / securedrop

Replace per-source PGP keys with PBKDF encrypted/generated ed25519 keys #3281

Feature request

Description

User Stories