element-hq / element-meta

Shared/meta documentation and project artefacts for Element clients
66 stars 11 forks source link

Users who join an encrypted room at the same time as a message is sent may receive a UTD (join/send race) #2268

Open richvdh opened 5 months ago

richvdh commented 5 months ago

This is exacerbated by federation, but the same thing can happen even on a single server. Ultimately, there is always a race between Alice's client calculating/fetching the membership list, and the message being sent.

Now: it's not clear that Bob should be able to decrypt a message that was sent 3 seconds after he joined the room — at a point where Alice didn't even know he was in the room — any more than he should be able to decrypt a message sent before he joined the room. The problem here is that the message is perceived as a product defect (and contributes to our "user saw a UISI" metric). So probably what we want to do is convey to Bob's client that it should not expect to be able to decrypt this message.

richvdh commented 5 months ago

I think we could address this with something like the following:

Step 1: make sure that Alice's client's idea of the room membership matches that of the room DAG at the point of the sent message. We could do this by having the client submit a hash of what it thinks the room membership is, when it sends the message. Alice's server can then cross-reference the hash with the actual room membership once it has decided where in the DAG to hang the new event. If there is a mismatch, the server rejects the send attempt and Alice's client resyncs the membership somehow. (If https://github.com/matrix-org/matrix-spec/issues/1209 (edit: and https://github.com/element-hq/synapse/issues/16940#issuecomment-1956348215) were fixed, that might just be a matter of having Alice's client redo the encryption attempt after a new /sync, but more generally a /members request might be necessary.)

Step 2: It now crosses over with historical room key sharing (cf https://github.com/element-hq/element-web/issues/26867), but in short: when an event is served to Bob by Bob's server, we could include an indication of whether Bob was a member of the room at the point in the DAG of the event. This would give Bob's client an indication of whether it should expect to be able to decrypt the event.

pmaier1 commented 5 months ago

Outcome from workshop today

andybalaam commented 5 months ago

In a room with no history visibility, imagine these events happen (because there is a netsplit):

graph TD;
    Before-->Send;
    Send-->After;
    Before-->Join;
    Join-->After;

Before: User1 and User2 are members Send: User1 sends a message, encrypted for User2 Join: User3 joins the room After: netsplit is over

Imagine that Join happens well before Send in wall-clock time.

Does User3's client receive the event sent in Send? If so, it will always be a UTD because User3 was not a member at that point in the room graph, so it would be incorrect for User1 to send keys to decrypt it.

It would be great if we could have a test from @kegsay that demonstrates whether this event arrives at User3's client.

kegsay commented 5 months ago

I encountered this in the wild today. Matthew did not realise I was part of the room and hence did not encrypt his message for me. As a result, all my clients are unable to decrypt this message. See screenshots.

Element X Android: argh

Element Desktop:

Screenshot 2024-01-22 at 12 40 10

In this case:

Total lag: 47s.

I don't think the hash solution works here because, from matrix.org's pov:

Alice's server can then cross-reference the hash with the actual room membership once it has decided where in the DAG to hang the new event.

I am not in the room, hence the hash would always be valid (assuming the lag isn't due to intra-HS components). TODO: check the room to see if the room forked at this time to be sure.


I think we can all agree that in this case, it is an expected UTD and should be suppressed by the receiving client. The problem is how to reliably detect this. The lag could have been minutes or hours, causing >1 UTD from Matthew. Sending some extra information in the m.room.encrypted event feels like the only valid way to detect this, as only the client has the necessary data.

In the most naive solution, if the client added the entire set of user|device ID pairs to the event, my client could clearly see I'm not part of that list and hence suppress the event. However, large E2EE rooms with lots of members exist, as well as pathological users with many devices. This data could be large. We could use a bloom filter here to provided probablistic suppression, and then sent the bitmask with the event.

We can calculate the values here. Assuming we want:

In extremely large rooms we could tweak these values:

The sending logic would be something like:

The receiving logic would be something like:

Critically, these properties means:

ara4n commented 5 months ago

How about simply hiding all consecutive UTDs you receive after joining a room on the assumption that they're old messages which predate you joining?

kegsay commented 5 months ago

How about simply hiding all consecutive UTDs you receive after joining a room on the assumption that they're old messages which predate you joining?

Assuming the hiding is per-sender, rather than just the entire timeline (remember this was just you here having this problem), that won't reliably stop UTDs from being displayed. It definitely helps in this particular case (you joined and get UTDs) but there are related failure modes which this doesn't help with because there isn't a convenient start point (the join event) which you can use as a guide for when to start hiding messages. Notably, when you login on a new device, until that device is synchronised with the sender, you'll get UTDs from the sender. We could generalise this even further and just hide all UTDs in the room until you successfully decrypt a single event from that sender (though really it should be that device), but I worry that would mask a ton of real failure modes.

richvdh commented 5 months ago

How about simply hiding all consecutive UTDs you receive after joining a room on the assumption that they're old messages which predate you joining?

We did something similar in the past for historical messages, and it was extremely flaky. The problem with any heuristics around UTDs is that it's very easy to end up catching lots of other failure modes by accident, and hiding half the timeline.

kegsay commented 5 months ago

Clarity around https://github.com/element-hq/element-meta/issues/2268#issuecomment-1889756997 there's basically two orthogonal solutions being proposed, both of which help guard against UTDs:

richvdh commented 5 months ago

My client should never have got that message at all.

Or if it did (eg, because history visibility settings permit it), your server should have given your client a caveat along the lines of "your membership state was actually leave at the point of this event"

kegsay commented 5 months ago

your server should have given your client a caveat along the lines of "your membership state was actually leave at the point of this event"

This would be a lovely way to suppress UTDs of messages prior to joining in a more reliable way than entirely client-side imo. My concern with all of this though is it's only a partial solution. You can get the same failure mode when you login with a new device, and then the DAG markup solution falls short. The bloom filter approach would work for both joins and new devices, and handles race conditions through the entire stack (CS and federation). The downside is extra bandwidth consumed, but for very low bandwidth use cases you'd not be using E2EE at all anyway (as ciphertexts cannot be compressed well, and typically you'd rely on network-level encryption if you're that focused on byte saving).

richvdh commented 5 months ago

The downside is extra bandwidth consume

The other downsides are:

I do think the bloom filter could be useful for some problems where we don't have better ideas (https://github.com/element-hq/synapse/issues/2165, maybe?) but I think there's room for both solutions.

kegsay commented 5 months ago

but I think there's room for both solutions.

I agree, but with limited time and developer effort, we surely want the biggest bang for our collective buck. All these solutions need MSCs. The sending hash would require client and server changes. The receiving server markup also requires client and server changes. Even with both of those, they don't help new device logins. The bloom filter just needs client changes, adding a few keys to the m.room.encrypted["content"] and helps for new device logins. The false positive rate can be tweaked, and given this is a UX solution, having a 1 in 10,000 chance of seeing it when these UTDs are pretty infrequent anyway may be acceptable enough to Product?

Assuming the bloom filter tells you you weren't included, it doesn't tell you why. This is useful for UX and metrics.

We could probably layer more bloom filters on top to include that information. E.g:

{
  "sent": 00010101... // the main "did I send this to your user|device key?" bloom filter
  "only_verified": 000101... // if you're in this map then this is because we only send to verified devices
  "no_otk": 0001... // if you're in this map then this is because there were no otks/fallback key available to use
}

This would then form multiple sets:

Because we assume these other filters are only going to contain a few elements if any, they can be very small whlist keep very low false positive rates: 1 in 10 million false positive for 20 entries is 84 bytes. This assumes that if you're in sent, you don't look in the other filters. You can also alternatively just keep the reason filters and not have the sent filter, in which case this becomes mostly diagnostic information.

richvdh commented 4 months ago

Step 2: It now crosses over with historical room key sharing (cf https://github.com/element-hq/element-web/issues/26867), but in short: when an event is served to Bob by Bob's server, we could include an indication of whether Bob was a member of the room at the point in the DAG of the event. This would give Bob's client an indication of whether it should expect to be able to decrypt the event.

https://github.com/matrix-org/matrix-spec-proposals/pull/4115 proposes a protocol change for this part of my idea.

uhoreg commented 4 months ago

Step 1: make sure that Alice's client's idea of the room membership matches that of the room DAG at the point of the sent message. We could do this by having the client submit a hash of what it thinks the room membership is, when it sends the message. Alice's server can then cross-reference the hash with the actual room membership once it has decided where in the DAG to hang the new event. If there is a mismatch, the server rejects the send attempt and Alice's client resyncs the membership somehow. ...

We have to be careful here: the sender needs to be aware that their message is going to be sent to people who were not in the room when they wrote the message. This may mean that we may need to do something like having the client prompt the sender to let them know about the membership changes and give them a chance to cancel sending.