freedomofpress / securedrop-protocol

Research and proof of concept to develop the next SecureDrop with end to end encryption.
GNU Affero General Public License v3.0
46 stars 1 forks source link

message_id enumeration requirements #43

Open ayende opened 2 months ago

ayende commented 2 months ago

It looks like a huge complicated aspect of secure drop is the need to avoid message_id enumeration. That is why there is the three party signing, etc.

What it the impact on the security model if you instead do something like:

Some time later:

Key aspect is that Source will also send a few additional "messages". Each one is ~80 bytes, IIRC. That means that on the server, you have no way to tell what is real and what is a dummy. The same message id can be sent to multiple journalists, or just one and the rest are faked.

In addition to that, the server will proactively generate fake messages on a random basis.

A journalist (or attacker) can observe:

Note that as it currently stand, even though secure drop isn't meant to scale, there are fairly easy ways to tell how many messages there are (and those detect when there is a new one). I can push 1001 dummy messages to the secure drop, leading to either:

Given that you need to allow anyone to scan through all the messages in the system, and they can also add items, it is easy to test how many items you have there. Given that, I don't know what the impact of just skipping the 3 party dance and dealing with it directly.

lsd-cat commented 2 months ago

Rate limiting in whistleblowing systems is a very complicated and AFAIK not solvable issue server-side. We cannot really limit spam submissions in general, or filter legitimate sources from malicious ones. With everything happening over the Tor network, and requiring no accounts (which is already the case now) there is no known way to rate limit server-side without risking impairing the experience of legitimate sources. This is common to all this kind of whistleblowing systems that deal with anonymous users and encrypted content, and our goal is to make filtering and management easy for journalists via the SecureDrop Client instead.

Note that as it currently stand, even though secure drop isn't meant to scale, there are fairly easy ways to tell how many messages there are (and those detect when there is a new one). I can push 1001 dummy messages to the secure drop, leading to either:

If you do that, that is an easy to detect attack, and we now have knowledge that someone is attacking the server. It is also not complicated to mitigate as things are now, the server just needs to refuse new messages and alert the administrator when it reaches 10000-rand(0, 500).

What it the impact on the security model if you instead do something like:

  • Source encrypts a file locally using journalist public key retrieved offline. crypto_seal_box( file, jounalist_public_key) => encrypted_file
  • Source POST /file/upload encrypted_file to the secure drop server. The server computes filename = token_hex(32) and message_id = token_hex(32) and persists that mapping. The server returns the message_id to the source.
  • Source computes crypto_seal_box( message_id, journalist_public_key) => encrypted_message_id

This has the same practical upper limit of our current mechanism. I picked 10K because it seemed a decent upper limit before having HTTP requests that are MB in size each and how much CPU we can use client-side for doing trial decryption. If I am not mistaken, those limits would end up being roughly the same plus we would lose, as you said, one of the requirements.

ayende commented 2 months ago

I'm asking why you care about that requirement. You are spending a lot of engineering effort on that, and it causes limits elsewhere in your architecture.

Consider the scenario of just aggregating crypto_seal_box( message_id, journalist_public_key) + dummy values. You can generate a constant rate of ~200 of those every 5 minutes.

You can then output a file with those values (dummies from the clients, dummies from the server).

You generate a 16KB file every 5 minutes, and throw that on a public folder. Anyone can read those. As a client, you try to decrypt those values.

As an attacker, you cannot tell if those values are dummies or real. The total stored value is ~1.6GB / year. And you are able to handle > 21M messages a year.

Note that from a client perspective, every 5 minutes, you download the latest file. Can even throw that into something like RSS feed.

Given this scheme, I'm trying to figure out what the hidden message_id gives you over this.

lsd-cat commented 2 months ago

The difference is that right now, only the server can run statistical attacks on submission and access patterns. If we drop that requirement, than anybody on the internet could. That is one of the reason of why we care about that requirement: in the end we have to produce decoy traffic anyway to make it difficult for the server to do so, but in the meantime, anybody else not in a privileged position cannot observe anything.

We could debate about engineering efforts, but if we are discussing scalability, I believe the limits are going to be basically the same, no matter if we do what we are proposing or a different iteration of trial decryption as you suggest.

ayende commented 2 months ago

I disagree with that. From any outside observe point of view, there is a constant rate of new submissions, and no way for anyone to tell whatever they are real or fake.

This is the same principle of always sending data over the write, saying nothing, so you can't analyze the traffic and detect that.