element-hq / element-meta

Shared/meta documentation and project artefacts for Element clients
65 stars 11 forks source link

Key backups: some backups have incorrect MAC causing UTDs on login #2338

Closed kegsay closed 1 month ago

kegsay commented 3 months ago

Filing in element-meta as it's not clear which client is uploaded the bad backups currently. Part of https://github.com/element-hq/element-meta/issues/245

Context

We have had rageshakes on Element X where some messages are UTD. These messages are historical messages, which arrived prior to the device logging in. In this case, we expect the messages to be undecryptable if there is no key backup set up. However, these rageshakes do have key backup set up, and there is a key for the given room/session. Therefore, we expect the message to be decryptable but it isn't, hence the bug.

Problem

The key fails to decrypt with the following error: 2024-03-12T11:02:22.098722Z WARN matrix_sdk::encryption::backups: Couldn't decrypt a room key we downloaded from backups, session ID: ..., error: InvalidPadding(UnpadError) | crates/matrix-sdk/src/encryption/backups/mod.rs:469 | spans: root

The padding is in reference to PKCS Padding which is padding added to the plaintext prior to encryption. From a look at the underlying source this error can happen when:

Through some digging, it appears that the affected ciphertext is a multiple of the block size (16), indicating that the padding may be malformed. Needs confirmation.

If this is true, then some client somewhere is calculating padding incorrectly, causing EX to fail to decrypt the key backup, causing UTDs.

We have reason to believe libolm may be too lax as vdh puts it: "if the last byte is, say, 20, then libolm will happily decrypt it and strip off the last 20 bytes, but presumably another impl will reject it". This merely means that libolm based clients (e.g legacy Element-Web) will not return an error if the padding is malformed, rather than implying Element-Web is at fault for uploading a bad backup.

Frequency

The pad error has subsequently been seen on Element X iOS and Android clients on a range of homeservers, including matrix.org. There are at least 2 independent reports of this specifically mentioning the failure mode described e.g "Fresh login, unable to decrypt some of the latest messages". However, it clearly isn't prolific, as we have many many instances of users using key backup without issue, involving thousands of keys.

BillCarsonFr commented 3 months ago

@kegsay FWIW if you try to decrypt a room key with the wrong key you will get this same error InvalidPadding(UnpadError)

So it could also be a problem with a client having an incorrect key in cache?

poljar commented 3 months ago

We're going to return a more sensible error once https://github.com/matrix-org/matrix-rust-sdk/pull/3250 is merged.

dkasak commented 3 months ago

@davidegirardi was also hit by this bug, so we went into a deep rabbit hole (mostly in this thread) of trying to figure out a root cause. We hit a dead end before being able to do so, but we did eliminate a bunch of possibilities.

We first suspected that a client might be incorrectly calculating the padding (but otherwise encrypting the key entry correctly), but this was eliminated as an option by manually deriving the AES and MAC keys and then attempting to decrypt using low-level methods which don't error out on incorrect padding. If it was only the padding that was wrong, this should've gotten us a mostly reasonable plaintext with some incorrect padding. Instead, we got gibberish.

The remaining possibilities are that the encrypted key entry ciphertext was somehow corrupted as a whole or that the wrong decryption key is being tried, which should only happen if some client had inserted the encrypted key entry into the wrong backup version.

BillCarsonFr commented 2 months ago

We're going to return a more sensible error once matrix-org/matrix-rust-sdk#3250 is merged.

The error looks to be now


Failed to decrypt megolm session from backup The MAC of the ciphertext didn't pass validation MAC tag mismatch
Error: The MAC of the ciphertext didn't pass validation MAC tag mismatch
dkasak commented 2 months ago

On Friday 2024-04-12, we spotted a client using two different backup keys for the same user and backup version which could also explain this bug.

More generally, there seem to be several spots in the Rust SDK backup code where we could ostensibly desync on the backup public key due to using too few checks when enabling backups.

One example is here, where the first if branch matches only on backup versions rather than checking for a matching backup info. So if for some reason two different processes end up with two different backups that both have version 2, we could desync. IMO, that first branch condition should be changed to also check the backup public key.

I can see this potentially happening in case a user deletes a backup manually and then creates a new one, in which case I think we'll reuse the same backup version for different backups. (No, we won't, at least not in Synapse. The rest of the point that follows still stands in theory, because different server implementations could make this mistake) The root cause of this is of course that backup versions are a footgun: they should either be the backup public key or deterministically derived from one, not a separate piece of information.

Another hypothetical way this could happen is due to a server bug.

EDIT (2024-04-20): While the above is not wrong, we now know that (most?) occurrences of this bug were due to https://github.com/matrix-org/internal-config/issues/1518.

kegsay commented 2 months ago

Indeed, we know the root cause here and a fix is incoming.

andybalaam commented 1 month ago

Fixed in the EXA and EXI releases from earlier this week.

richvdh commented 3 weeks ago

Since I don't see it explained here, the problem here was that we were using the secret key to encrypt backed up keys, instead of the public key. (The intention of key backup is that any device should be able to add keys to your key backup, whereas only a verified device should be able to fetch keys; hence the use of a public key to add new keys).

The same problem caused https://github.com/advisories/GHSA-9ggc-845v-gcgv. The fix was https://github.com/matrix-org/matrix-rust-sdk/commit/11de0449face33adfcd4faa976b3a9bbbcf4bc5b (backported into matrix-sdk-crypto 0.7.1 by https://github.com/matrix-org/matrix-rust-sdk/pull/3402).