Open pmaier1 opened 1 year ago
@pmaier1 I'm not really sure what's to be done here. We already have recovery from wedged olm sessions. Can you explain what precise scenario we're trying to cover here?
Well, to my understanding there still are cases where a user has to manually type /discardsession
in order to recover. The idea here was to automate this, if possible.
Hrm. /discardsession
replaces the megolm session, and megolm sessions can't really get "wedged" in the same way as olm sessions. /discardsession
won't do anything to help with a wedged olm session.
Generally I'd say we should figure out what cases a /discardsession
actually helps with, and propose real fixes for them on a case-by-case basis, rather than automating a /discardsession
(which I suspect would be hard, in any case).
Any idea what those cases are?
We only expect to see this when clients or servers need to rollback their database.
There are other cases where /discardsession
will help fix things, but those other cases should be fixable, whereas server/client rollbacks aren't.
The solution to this ticket is supposed to also solve https://github.com/element-hq/element-meta/issues/2155.
We think this would be usefull for robustness. But for now we focus efforts on finding root causes of wedging
It isn't made explicit anywhere above: I believe the intention of this issue is to improve the current olm session recovery: the current implementation does nothing to help with Olm messages that have already been sent. (Which is why /discardsession
helps: it ensures that the next message the user sends will cause a new megolm keyshare, over a new Olm session.)
As a recipient, if we detect a wedged olm session (which causes us to make a new olm session), we need to tell the sender about the situation so that they know they still need to send us the key. Ideally, the sender needs to send us all megolm keys they already tried to send, but at the very least they need to mark all existing keys as "not yet shared with this device".
As a partial solution to this, we could not worry about past messages, but at least improve the situation for future messages.
Currently: the olm-session unwedging doesn't help with existing megolm sessions until the sending user does a /discardsession
, or the megolm session is rotated for another reason (eg, someone leaving the room).
Instead we could:
m.dummy
message (as now), they instead send a more explicit "failure" message.
/discardsession
in this scenario, in a rather more efficient way.:warning: we would need to be careful to still rotate the megolm session if the device leaves the room after sending the "failure" notification, in case it was lying about the failure.
As a partial solution to this, we could not worry about past messages, but at least improve the situation for future messages....
I have split this out to https://github.com/element-hq/element-meta/issues/2389.
Some more context for the record. Issues like the following https://github.com/element-hq/synapse/issues/17117 (following mx.org outage), could result also in dlivery failures. This is a bug that need to be fixed, but currently there is no way to recover from it (until next session rotation)
I asked some questions about how we recover from wedged Olm sessions. The purpose of sending an m.dummy
Olm message is:
to try to make the receiver use the new session and hence not cause the sender to get a UTD from them by continuing to use the wedged session
There are potentially problems around this mechanism though. Rust SDK seems to not follow the spec which says:
If a client has multiple sessions established with another device, it should use the session from which it last received and successfully decrypted a message.
It seems to use creation time instead. We need to align on which is correct. In addition, vdh points out that client session timestamps have been corrupted so we likely need to consider that too.
In all cases, recovering from a wedged OIm session involves a new /keys/claim
which itself may cause UTDs which is particularly ironic, given we're doing this to fix UTDs.
High Level:
Explore a way to (selectively!) re-request encryption keys to increase crypto reliability. Given that olm session can get wedged, could it be possible find a way to recover from that: i.e be able to get back the failed to decrypt message.
IT IS NOT the re-implementation of key-requesting that was terrible (creating a lot of traffic and slowing down everything)
Scenario:
m.room_key
(not a forwarded key)The recipient would send the to_device failed to decrypt ack only to the exact sender of the to device! (different from key requesting that would ask to all the current user devices)