element-hq / element-meta

Shared/meta documentation and project artefacts for Element clients
66 stars 11 forks source link

UTD: Some clients are encrypting Olm messages with the same ratchet key and chain index #2302

Closed kegsay closed 2 months ago

kegsay commented 4 months ago

Describe the bug @Hywan completely nerd sniped me. Blame him. He pointed me at this UTD rageshake on EIX. It turns out that the cause was with the Olm session between the two devices, not the delivery of to-device events or room keys. For some reason, the Olm session had "wedged". These are the reasons why sessions may wedge, so I set about ruling out the sender being at fault.

I collected all undelivered to-device events on all sliding sync shards and analysed the message content. The ciphertext has a particular format and there should be only 1 event with the same (sender key, chain index, ratchet key). If there were multiple, this indicates using the same key for encrypting multiple events. The receiver won't know this has happened, and so the Olm session will wedge. The results:

When this happens, the receiver will see a UTD. Wedged session recovery should fix this for subsequent messages.

To Reproduce

Unsure. EI and ED don't really share any code, so this feels like a systemic problem with how the SDKs are being used.

Expected behavior Olm messages should always be decryptable.


My working theory on this:

I've looked at the JS SDK and cannot cause the read-modify-write to trip up. I cannot get indexeddb locking to fail cross tab / web worker on Firefox or Chrome either (with a test rig that increments an integer in a loop). Worryingly though, both browsers do not provide durable writes:

The complete event may thus be delivered quicker than before, however, there exists a small chance that the entire transaction will be lost if the OS crashes or there is a loss of system power before the data is flushed to disk. Since such catastrophic events are rare, most consumers should not need to concern themselves further.

https://developer.mozilla.org/en-US/docs/Web/API/IDBTransaction#firefox_durability_guarantees

It's important to note that strict does not ensure that changes are actually written immediately to disk. After a site calls put(), there's still some finite amount of time during which a power failure could cause the change to not make it to disk and therefore be missing the next time the app runs.

https://developer.chrome.com/blog/indexeddb-durability-mode-now-defaults-to-relaxed

I'm unsure how much this matters in practice, as if there is a power failure, it's reasonable to assume that the newly encrypted event did not have enough time to be sent.

richvdh commented 4 months ago

Possible EW cause: https://github.com/element-hq/element-web/issues/25157

I don't really understand how that issue, which was closed back in August, could cause this?

richvdh commented 4 months ago

AFAICT The concern over writes not being durable is not relevant for cross-tab locking. (If a claim to exclusive ownership of the indexeddb is lost due to power-failure / process crash, then we can also be sure that the now-dead process is not mutating the indexeddb, so it doesnt matter that the claim is lost.)

However, it is relevant for updates to the Olm Session in indexeddb. It seems like we should be setting durability:strict on indexeddb transactions that bump the Olm sender ratchet.

richvdh commented 2 months ago

We've now tracked down a big cause of this in Element X iOS in https://github.com/matrix-org/matrix-rust-sdk/issues/3310. I also have suspicions that https://github.com/matrix-org/matrix-rust-sdk/issues/3313 could be a source of problems.

I have also opened https://github.com/matrix-org/matrix-rust-sdk/issues/3354 regarding the indexeddb durability mode.

I don't really think this issue is giving much value: it's somewhat handwavey and unactionable. I'm going to close it in favour of the above issues.