Closed kegsay closed 2 months ago
Possible EW cause: https://github.com/element-hq/element-web/issues/25157
I don't really understand how that issue, which was closed back in August, could cause this?
AFAICT The concern over writes not being durable is not relevant for cross-tab locking. (If a claim to exclusive ownership of the indexeddb is lost due to power-failure / process crash, then we can also be sure that the now-dead process is not mutating the indexeddb, so it doesnt matter that the claim is lost.)
However, it is relevant for updates to the Olm Session in indexeddb. It seems like we should be setting durability:strict
on indexeddb transactions that bump the Olm sender ratchet.
We've now tracked down a big cause of this in Element X iOS in https://github.com/matrix-org/matrix-rust-sdk/issues/3310. I also have suspicions that https://github.com/matrix-org/matrix-rust-sdk/issues/3313 could be a source of problems.
I have also opened https://github.com/matrix-org/matrix-rust-sdk/issues/3354 regarding the indexeddb durability mode.
I don't really think this issue is giving much value: it's somewhat handwavey and unactionable. I'm going to close it in favour of the above issues.
Describe the bug @Hywan completely nerd sniped me. Blame him. He pointed me at this UTD rageshake on EIX. It turns out that the cause was with the Olm session between the two devices, not the delivery of to-device events or room keys. For some reason, the Olm session had "wedged". These are the reasons why sessions may wedge, so I set about ruling out the sender being at fault.
I collected all undelivered to-device events on all sliding sync shards and analysed the message content. The ciphertext has a particular format and there should be only 1 event with the same (sender key, chain index, ratchet key). If there were multiple, this indicates using the same key for encrypting multiple events. The receiver won't know this has happened, and so the Olm session will wedge. The results:
type: 1
(already established Olm sessions),sender_key
to a device ID and hence display name shows the affected clients are:When this happens, the receiver will see a UTD. Wedged session recovery should fix this for subsequent messages.
To Reproduce
Unsure. EI and ED don't really share any code, so this feels like a systemic problem with how the SDKs are being used.
Expected behavior Olm messages should always be decryptable.
My working theory on this:
I've looked at the JS SDK and cannot cause the read-modify-write to trip up. I cannot get indexeddb locking to fail cross tab / web worker on Firefox or Chrome either (with a test rig that increments an integer in a loop). Worryingly though, both browsers do not provide durable writes:
https://developer.mozilla.org/en-US/docs/Web/API/IDBTransaction#firefox_durability_guarantees
https://developer.chrome.com/blog/indexeddb-durability-mode-now-defaults-to-relaxed
I'm unsure how much this matters in practice, as if there is a power failure, it's reasonable to assume that the newly encrypted event did not have enough time to be sent.