element-hq / element-meta

Shared/meta documentation and project artefacts for Element clients
65 stars 11 forks source link

Ensure that existing megolm sessions are re-shared after a wedged olm session #2389

Closed richvdh closed 4 days ago

richvdh commented 2 months ago

TL;DR: implement a partial solution to https://github.com/element-hq/element-meta/issues/1992 which at least fixes the problem for future messages.


Suppose a recipient fails to decrypt an Olm message, due to either database rollback or other Olm implementation errors such as database corruption. Obviously, if the Olm message contained a Megolm key, the recipient is then unable to decrypt any messages encrypted with that Megolm session.

Currently, when this happens, the recipient attempts an "unwedging" operation, which involves creating a new Olm session (and sending a dummy message over it). This should help for any future Megolm sessions, but does nothing to help with any Megolm sessions whose keys have already been shared over the broken Olm session. Indeed, since an existing Megolm session can be used for some time, the sender may continue to send room messages that the recipient cannot decrypt for some time after a "successful" unwedging operation. (In this case, a manual /discardsession on the sender can provide a workaround.)

A full solution to this problem involves the sender re-sharing keys for all Megolm messages that have already been sent over the broken session. This is tracked at https://github.com/element-hq/element-meta/issues/1992. However, that full operation is somewhat thorny, as it means keeping good records of exactly which Megolm keys have been shared with which recipient devices, and at what ratchet state.

As a partial solution, we could not worry about past messages, but at least improve the situation for future messages.

So, we could:

:warning: we would need to be careful to still rotate the megolm session if the device leaves the room after sending the "failure" notification, in case it was lying about the failure.


I think this would be easier than a full solution to https://github.com/element-hq/element-meta/issues/1992, whilst still providing a useful base on which to build a full solution.

richvdh commented 2 weeks ago

@uhoreg wrote:

I think that there are two main ways to fix this:

  1. when we get an m.dummy from someone, we search through all outbound group sessions, and mark them as "needs-resharing" for that group session if we've already shared with them. "needs-reshare" counts as not-shared when determining whether we should share the session with a device, but counts as shared when determining whether the session needs to be rotated. This should be fairly simple to do, but could be slow if there are many outbound group sessions. The main problem is that we don't seem to have any function to fetch all outbound group sessions, so we'd need a new function in all the CryptoStore implementations.

  2. for each device, we store some sort of counter. When we share a group session with a device, we store the current value of the counter with the group session. When we check if a group session is shared with a device, we compare the stored counter for that device (if available) with the device's current counter. If the device's current counter is bigger than the counter stored with the group session, it needs to be re-shared. When we get an m.dummy from someone, we increment that device's counter. This may be a bit more complicated, but it won't matter how many outbound sessions there are.

I think I agree, though it's not completely obvious to me that fetching all outbound group sessions is going to be prohibitively slow (It's worth noting that we have a maximum of one OGS per (encrypted) room, since whenever we create a new session in a room it replaces the old one. So maybe a few hundred?)

That said, even a few hundred could take a while to iterate, and it would be better to avoid it. The counter mechanism you suggest is elegant and has similarity to our solution to slow key backup reset (https://github.com/element-hq/element-web/issues/26892).

I think it should be fairly easy to extend the SharedWith struct to include a counter, and likewise to add a counter to (the confusingly-named) ReadOnlyDevice.

uhoreg commented 1 week ago

The approach that I took in https://github.com/matrix-org/matrix-rust-sdk/pull/3604 was to check, when we decrypt an Olm message, to see whether that Olm message starts a brand now Olm session that we haven't seen before. If this happens, then we consider it to be an unwedging attempt from the other party. This should be more robust that checking that the event is an m.dummy, and also allows us to keep treating m.dummy itself as a no-op.