element-hq / element-meta

Shared/meta documentation and project artefacts for Element clients
72 stars 12 forks source link

Gracefully recover from wedged session (discard session for next distribution, or reshare as m.room.key) #1992

Open pmaier1 opened 1 year ago

pmaier1 commented 1 year ago

High Level:

Explore a way to (selectively!) re-request encryption keys to increase crypto reliability. Given that olm session can get wedged, could it be possible find a way to recover from that: i.e be able to get back the failed to decrypt message.

IT IS NOT the re-implementation of key-requesting that was terrible (creating a lot of traffic and slowing down everything)

Scenario:

The recipient would send the to_device failed to decrypt ack only to the exact sender of the to device! (different from key requesting that would ask to all the current user devices)

richvdh commented 11 months ago

@pmaier1 I'm not really sure what's to be done here. We already have recovery from wedged olm sessions. Can you explain what precise scenario we're trying to cover here?

pmaier1 commented 11 months ago

Well, to my understanding there still are cases where a user has to manually type /discardsession in order to recover. The idea here was to automate this, if possible.

richvdh commented 11 months ago

Hrm. /discardsession replaces the megolm session, and megolm sessions can't really get "wedged" in the same way as olm sessions. /discardsession won't do anything to help with a wedged olm session.

Generally I'd say we should figure out what cases a /discardsession actually helps with, and propose real fixes for them on a case-by-case basis, rather than automating a /discardsession (which I suspect would be hard, in any case).

Any idea what those cases are?

kegsay commented 8 months ago

We only expect to see this when clients or servers need to rollback their database.

There are other cases where /discardsession will help fix things, but those other cases should be fixable, whereas server/client rollbacks aren't.

pmaier1 commented 8 months ago

The solution to this ticket is supposed to also solve https://github.com/element-hq/element-meta/issues/2155.

BillCarsonFr commented 7 months ago

We think this would be usefull for robustness. But for now we focus efforts on finding root causes of wedging

richvdh commented 6 months ago

It isn't made explicit anywhere above: I believe the intention of this issue is to improve the current olm session recovery: the current implementation does nothing to help with Olm messages that have already been sent. (Which is why /discardsession helps: it ensures that the next message the user sends will cause a new megolm keyshare, over a new Olm session.)

As a recipient, if we detect a wedged olm session (which causes us to make a new olm session), we need to tell the sender about the situation so that they know they still need to send us the key. Ideally, the sender needs to send us all megolm keys they already tried to send, but at the very least they need to mark all existing keys as "not yet shared with this device".

richvdh commented 5 months ago

As a partial solution to this, we could not worry about past messages, but at least improve the situation for future messages.

Currently: the olm-session unwedging doesn't help with existing megolm sessions until the sending user does a /discardsession, or the megolm session is rotated for another reason (eg, someone leaving the room).

Instead we could:

:warning: we would need to be careful to still rotate the megolm session if the device leaves the room after sending the "failure" notification, in case it was lying about the failure.

richvdh commented 5 months ago

As a partial solution to this, we could not worry about past messages, but at least improve the situation for future messages....

I have split this out to https://github.com/element-hq/element-meta/issues/2389.

BillCarsonFr commented 5 months ago

Some more context for the record. Issues like the following https://github.com/element-hq/synapse/issues/17117 (following mx.org outage), could result also in dlivery failures. This is a bug that need to be fixed, but currently there is no way to recover from it (until next session rotation)

kegsay commented 5 months ago

I asked some questions about how we recover from wedged Olm sessions. The purpose of sending an m.dummy Olm message is:

to try to make the receiver use the new session and hence not cause the sender to get a UTD from them by continuing to use the wedged session

There are potentially problems around this mechanism though. Rust SDK seems to not follow the spec which says:

If a client has multiple sessions established with another device, it should use the session from which it last received and successfully decrypted a message.

It seems to use creation time instead. We need to align on which is correct. In addition, vdh points out that client session timestamps have been corrupted so we likely need to consider that too.

In all cases, recovering from a wedged OIm session involves a new /keys/claim which itself may cause UTDs which is particularly ironic, given we're doing this to fix UTDs.

BillCarsonFr commented 4 months ago

Crosslink https://github.com/matrix-org/matrix-rust-sdk/issues/3356