Epic: One-Time Keys can desync between client and server causing UTDs

kegsay commented 2 months ago

To establish a secure channel between a pair of devices, all Matrix clients use a double-ratchet algorithm called Olm. As part of this establishment, a one-time key (OTK) must be "claimed" (via /keys/claim) by the sending device on the receiving device's server. OTKs are public/private keypairs where the private key is always kept on the device in a database. The public key gets uploaded to the user's homeserver, so even if the device is offline anyone can establish a secure channel to that device. OTKs are used once and then deleted from both the client and server databases. All E2EE devices will upload some amount of OTKs to the server, and try to keep that amount above a certain value as people constantly claim (and hence delete) OTKs.

This issue describes many scenarios where the private keys stored on the device do not match the public keys stored on the server: in other words the set of OTKs are no longer synchronised between client and server. If this happens, there will be at least 1 unable to decrypt message. If clients do not automatically recover from wedged Olm sessions then this will be potentially more than 1 message.

Client Issues

Clients may forget which OTKs were uploaded to the server, either through idempotency failures, usage of multiple clients in multiple processes which do not talk to each other or other unknown reasons. If this happens, things desync as the client does not have the private key to the OTKs stored on the server.

Clients should only delete the private key when the OTK was successfully used. There have been bugs where the client has deleted the private key incorrectly on decryption failures.

Server Issues

The server database could become corrupted, through a hardware failure or via a bad homeserver update. Server admins will usually rollback to a "known good" version of the database when this happens. However, if anyone has claimed a OTK during this period between the bad homeserver update and the rollback, this will be forgotten by the server, causing a desync. The server will then potentially provide the same OTK to a different user in response to /keys/claim. This is https://github.com/element-hq/element-meta/issues/2155

Protocol Issues

Databases are not infinite. This is especially true on mobile devices. As a result, clients will often keep around the most recent N private OTKs, where N is very large (Vodozemac sets this to const MAX_ONE_TIME_KEYS: usize = 100 * PUBLIC_MAX_ONE_TIME_KEYS; which is currently 100*50 = 5,000). If this number is reached, the client will delete OTKs. Unfortunately, there is no API to tell the server to also delete these OTKs, meaning things will desync. This is https://github.com/element-hq/element-meta/issues/2356

Multiple users could hit /keys/claim at the same time the target user is uploading keys. This creates race conditions which can cause the same key to be used multiple times.

What do we do about this?

Some of these issues are just gnarly bugs. Some of these issues are failures in the protocol, but some are fundamental problems that cannot be resolved easily. The thing all these bugs share in common though is a desync between client and server. We can do something to detect this and take corrective action.

For example, the server could send a hash of all the public keys (or some other O(1) checksum which does not scale with the number of OTKs on the server) down /sync, which would allow clients to check that the keys stored server-side are in fact the same keys they have stored. If there is a mismatch, there needs to be some API for the client to control which keys are kept and which keys are discarded. If such an API existed, this would address the vast majority of these issues in one go. There are tradeoffs however:

It risks papering over race conditions and idempotency failures in client/server implementations, but this feels no worse than automatically recovering "wedged" Olm sessions: it improves reliability of the crypto stack.
It's unclear how to handle "in-flight" OTKs where a remote user has claimed a OTK (so deleted from the server) but not yet sent a message encrypted with said OTK (so the private key must be kept on the client). This relates to https://github.com/element-hq/element-meta/issues/2356

richvdh commented 2 months ago

There's a lot of words here, and a lot of links to closed issues. I think the actual issues still outstanding are:

There is also https://github.com/element-hq/element-meta/issues/1992, but that is more of a discussion of recovering the situation than a root cause.

Is all that fair so far?

Of the four issues linked above:

https://github.com/matrix-org/matrix-rust-sdk/issues/1415 seems to be some sort of concurrency fail in EX iOS, which we really need to get to the bottom of anyway.
I don't think any sort of syncing process is going to solve https://github.com/matrix-org/matrix-spec/issues/1124.
As you say yourself it doesn't help with https://github.com/element-hq/element-meta/issues/2356.

So is your proposal really all about dealing with https://github.com/element-hq/element-meta/issues/2155 ?

kegsay commented 2 months ago

A desync solution could help with https://github.com/element-hq/element-meta/issues/2356 if the server included in-flight key IDs.
Re concurrency fails, there are other sources which we may end up hitting e.g multi-tab in Element-Web. We could fix each and every one of these issues, but that's not the point of this meta-issue. It's the same reason why we recover from wedged Olm sessions rather than you know, "just" fixing the wedging in the first place.
Yes, this solution does help with server rollbacks, in addition to client rollbacks (e.g restoring phone from a backup).

So overall no, it's not really just about dealing with server rollbacks.

richvdh commented 2 months ago

A desync solution could help with https://github.com/element-hq/element-meta/issues/2356 if the server included in-flight key IDs.

How would the server know which keys were in-flight though? The server knows that a given OTK has been claimed, but, particularly in the case of federation, it doesn't know any better than the receiving client whether that means there is a message on its way using that OTK, or if the client that claimed the OTK exploded in a ball of flames just after the claim, so the OTK will never actually be used. We could maybe have the server do some introspection of the to-device-message queue, but... eeeehrgh.

Incidentally, the main problem with https://github.com/element-hq/element-meta/issues/2356 is that the server can hand out OTKs in any order it likes, so the client has no idea whatsoever which keys it should consider useless. Fixing that is easy, and doesn't require any fancy synchronisation system.

Re concurrency fails, there are other sources which we may end up hitting e.g multi-tab in Element-Web. We could fix each and every one of these issues, but that's not the point of this meta-issue. It's the same reason why we recover from wedged Olm sessions rather than you know, "just" fixing the wedging in the first place.

Well, in both cases I'd say that we wouldn't (and shouldn't) have the recovery mechanism if the only reason for it was to cover up concurrency bugs. The downsides (papering over implementation bugs) would outweigh the advantages in that situation. The recovery mechanism is inherently racy so you're still going to have failures; you just make the underlying bugs harder to track down and fix.

So sure, we can implement a "resync" mechanism, but IMHO "it will help cover up concurrency bugs" is not a compelling reason to do so.

One of the reasons I'm trying to get a handle on what we're actually trying to fix here is that I think it will determine the design of a fix:

For example, the server could send a hash of all the public keys (or some other O(1) checksum which does not scale with the number of OTKs on the server) down /sync, which would allow clients to check that the keys stored server-side are in fact the same keys they have stored.

I'm not convinced this hash mechanism will work. Given there can be a significant delay between an OTK being claimed and it being used in a message, it's not unusual for clients to have more keys stored than are still available on the server.

If we could agree that the thing we're trying to fix here is client- or server- rollbacks, then I think possibly what we need is a sequence number: the server can say "the most recent OTK I knew about was number 563". The client can then tell if the server has been rolled back and forgotten some (and the client should re-upload them), or if the client has been rolled back and forgotten some (and needs to tell the server to forget the lost keys asap). I think there's a lot of fiddly detail to sort out, but the principle is something like that.

kegsay commented 3 weeks ago

Regarding server-rollbacks, @uhoreg had a rather simple and elegant solution: just drop the OTKs table on rollbacks. This will cause 0 OTK counts to be sent down /sync, causing clients to upload new OTKs. Claiming keys in the interim would need to use the fallback key.

kegsay commented 1 week ago

Thinking more on this, we could also do a similar solution for clients - if the /keys/upload endpoint returns a HTTP 400 indicating "key already exists" -> delete all OTKs on the server for that device. This will then cause a fresh upload by the client, resyncing things.

We are almost certainly going to need this (or something similar) because we have a growing list of devices who have desynced their OTKs. Even if we fix the cause of the desync, it doesn't help that there is this ticking time bomb unless we delete the OTKs.

richvdh commented 1 week ago

if the /keys/upload endpoint returns a HTTP 400 indicating "key already exists" -> delete all OTKs on the server for that device.

I think you've proposed this, or something very like it, before (https://github.com/matrix-org/matrix-rust-sdk/issues/1415#issuecomment-1881250093). I remain hard against it, at least until we have reasonable confidence that we've fixed the underlying bugs.

it doesn't help that there is this ticking time bomb unless we delete the OTKs.

Which ticking time bomb are you referring to? https://github.com/element-hq/element-meta/issues/2356? I don't think that deleting OTKs on a "key already exists" upload error would do anything to help with that.

kegsay commented 9 hours ago

I've made https://github.com/matrix-org/matrix-spec-proposals/pull/4162 which I think addresses this.

element-hq / element-meta