Clients may throw away one-time keys which are still published on the server, or have messages in flight

richvdh commented 7 years ago

Background: On first login, clients generate about 50 one-time-keys. Each key consists of a public part and a secret part. The public parts are uploaded to the server, and the secret parts are retained on the device. A public one-time-key can then be "claimed" later by other devices, to initiate a secure Olm channel between the two devices. (The secure channel is then used for sharing Megolm message keys.) Whenever a client finds that there are less than 50 keys on the server, it will generate more key pairs and upload the public parts.

However, there is a limit to the number of secret keys that a client can keep hold of, and over time it may accumulate unused secret keys that appear to have never been used. The client has little alternative but to throw away such old keys.

Problem: there is nothing in the spec about the order in which servers give out one-time keys. In particular, Synapse does not give them out in the obvious first-in first-out order (see below). It is therefore possible for some very old public keys to hang around on the server. By the time those keys are eventually claimed by the sender, it is quite possible that the receiving device has forgotten the secret part of the corresponding key.

The net effect is that the recipient cannot read the messages sent over the Olm channel, does not receive the message keys, and the user sees an Unable To Decrypt error.

(Worse: we have mitigations to "unwedge" such broken Olm channels, but they don't actually work very well in this case due to https://github.com/matrix-org/matrix-rust-sdk/issues/3427.)

[x] Fix in Synapse: https://github.com/element-hq/synapse/pull/17903
[x] Write MSC to mandate FIFO for OTKs: https://github.com/matrix-org/matrix-spec-proposals/pull/4225
[x] Write complement test: https://github.com/matrix-org/complement/pull/744

Original description:

element-hq/element-web#2782 was a special-case of this, but there are other cases when we can throw away important one-time keys. It may be impossible to make this completely water-tight, but I'd like to add some logging when we throw away one-time keys, and reconsider the heuristic. (Does /claim give out the oldest key? which is the one we expire first?)

richvdh commented 7 years ago

(Does /claim give out the oldest key? which is the one we expire first?)

The specific scenario I'm worrying about here is:

Alice creates 50 OTKs, numbered 1 to 50, and uploads them to the server.
Bob wants to message Alice and claims one of the keys. Alice's server hands out one of the keys randomly: let's say key 10. Alice does not immediately receive Bob's pre-key message. (Potentially, Bob's client goes offline. Or Bob may even claim the key maliciously.)
Alice's server tells her that she has only 49 OTKs left on the server. She must make a new key; to do so, she must throw away the private part of one of her existing OTKs. She throws away the oldest one - key 1. However, the public part of that key is still on Alice's server.
Later, Carol wants to message Alice, and claims a key. She is given key 1. Alice cannot decrypt it, because she threw away the private part. Sad times.

Synapse's algorithm for picking a key for /keys/claim is:

SELECT key_id, key_json FROM e2e_one_time_keys_json WHERE user_id = ? AND device_id = ?
 AND algorithm = ? LIMIT 1

Since that's totally unsorted, you've got a good chance of some very old keys sitting around for ages.

MadLittleMods commented 1 year ago

@richvdh Is this issue still useful to you?

Besides fixing the issue itself, has logging around this been added with the latest crypto updates (perhaps in the Rust implementation)? Is this area easier to trace now?

richvdh commented 1 year ago

It's still a potential problem. I don't really remember if more logging got added; the fact that nothing is linked here suggests that it didn't.

Element-R won't necessarily fix the problem in itself but is rewriting this area, so it's not worth investigating in the legacy implementation.

BillCarsonFr commented 8 months ago

Rust SDK is keeping a lot more OTK (50 * 100). Lib olm appears to keep 100. So this would be a lot less common on rust.

richvdh commented 8 months ago

This is mitigated in Element R, because the Rust SDK keeps up to 5000 OTKs before it starts throwing them away.

However, it is still a problem, because that 5000 limit will eventually fill up (from network blips etc which cause keys to be claimed but not used), and then the Rust SDK will start throwing keys away. Whilst the Rust SDK will throw away the "oldest" key, we have no guarantee that that key is not still active on the server, so the problem recurs.

kegsay commented 8 months ago

To further clarify why this is a problem, consider what happens when all 5000 keys are used on the client and 50 of those keys are on the server and someone calls /keys/claim for that device:

Synapse picks a random key to give to the caller of /keys/claim,
Synapse then drops the count to 49 and sends it down /sync,
The client sees 49 and uploads one more OTK. It has 5001 keys now so needs to throw away a key, which it does by age (oldest key gets thrown).
What happens if the oldest key was the one that Synapse provided to the caller?

In this case, when the caller encrypts for that device, the pre-key message will be undecryptable because the client threw away the OTK private key.

kegsay commented 7 months ago

This is a ticking time bomb. We are likely to see this as the age of client's DBs get older and more OTKs accumulate which are never claimed, consuming the 5000 buffer.

A heuristic like "oldest" could work, or perhaps more usefully, the server could send back the key IDs it gave out rather than just decrementing the count by 1. The latter is more invasive though.

t3chguy commented 7 months ago

Moving as this seems to apply to all clients, including the Rust SDK based ones

richvdh commented 4 months ago

Synapse's algorithm for picking a key for /keys/claim is:
SELECT key_id, key_json FROM e2e_one_time_keys_json WHERE user_id = ? AND device_id = ?
 AND algorithm = ? LIMIT 1
Since that's totally unsorted, you've got a good chance of some very old keys sitting around for ages.

It's actually worse than "totally unsorted". In practice, postgres will use the first matching entry from the e2e_one_time_keys_json_uniqueness index, which covers (user_id, device_id, algorithm, key_id) -- so Synapse will end up handing out the OTK with the lexicographically-lowest key_id.

Now, the key ID is a base64-encoding of a 32 bit int; in particular that means that, for example, the key ids for the 208th through 255th keys (AAAA0A - AAAA/w) sort before those of the 0th through 207th keys (AAAAAA-AAAAxw). So as soon as the 208th key is published, it will be issued by the next /keys/claim, even though we may have many of the earlier keys sitting around -- none of which will be issued until the client publishes another 50 keys or so. This pattern repeats for every 4-character prefix (256 keys) -- and gets even worse at key 13312 (key ID AAA0AA), which sorts before any earlier-generated key.

Key IDs later in the alphabet (AAA.?., where . is any character and ? is, say, s-z) are particularly likely to get dropped because of this: they are the keys likely to be unclaimed at the point the client starts publishing earlier-sorted key IDs, and therefore those most likely to still be unclaimed while the client is generating keys with much later key IDs.

kegsay commented 4 months ago

https://github.com/element-hq/synapse/issues/17267 can cause us to have indefinitely claimed OTKs because the receiver timed out the request.

richvdh commented 4 months ago

element-hq/synapse#17267 can cause us to have indefinitely claimed OTKs because the receiver timed out the request.

What do you mean by "indefinitely claimed OTKs"? An OTK that is claimed but never used, causing it to get "stuck" on the client? That's true; the same can result from clients issuing a /keys/claim request but losing connectivity before the response arrives. But that's a separate issue to this one?

Edit: well, it's not a separate issue: it's the reason that clients end up having to throw away the private part of OTKs. What I really mean is: it's pretty much expected behaviour.

kegsay commented 4 months ago

Indeed, the reason why I'm flagging it is because it outlines a real-world example of how we can accumulate keys on the client.

element-hq / element-meta

Clients may throw away one-time keys which are still published on the server, or have messages in flight #2356