Users who join an encrypted room at the same time as a message is sent may receive a UTD (join/send race)

richvdh commented 5 months ago

Alice is in an encrypted room
Bob joins the room just before Alice sends a message
Alice's client doesn't know about Bob, so doesn't share the key with him
Bob observes a UTD.

This is exacerbated by federation, but the same thing can happen even on a single server. Ultimately, there is always a race between Alice's client calculating/fetching the membership list, and the message being sent.

Now: it's not clear that Bob should be able to decrypt a message that was sent 3 seconds after he joined the room — at a point where Alice didn't even know he was in the room — any more than he should be able to decrypt a message sent before he joined the room. The problem here is that the message is perceived as a product defect (and contributes to our "user saw a UISI" metric). So probably what we want to do is convey to Bob's client that it should not expect to be able to decrypt this message.

richvdh commented 5 months ago

I think we could address this with something like the following:

Step 1: make sure that Alice's client's idea of the room membership matches that of the room DAG at the point of the sent message. We could do this by having the client submit a hash of what it thinks the room membership is, when it sends the message. Alice's server can then cross-reference the hash with the actual room membership once it has decided where in the DAG to hang the new event. If there is a mismatch, the server rejects the send attempt and Alice's client resyncs the membership somehow. (If https://github.com/matrix-org/matrix-spec/issues/1209 (edit: and https://github.com/element-hq/synapse/issues/16940#issuecomment-1956348215) were fixed, that might just be a matter of having Alice's client redo the encryption attempt after a new /sync, but more generally a /members request might be necessary.)

Step 2: It now crosses over with historical room key sharing (cf https://github.com/element-hq/element-web/issues/26867), but in short: when an event is served to Bob by Bob's server, we could include an indication of whether Bob was a member of the room at the point in the DAG of the event. This would give Bob's client an indication of whether it should expect to be able to decrypt the event.

pmaier1 commented 5 months ago

Outcome from workshop today

Product view: Users who haven't fully joined a room when a message was sent are not supposed to be able to read the message. They should therefore not see the message (and should not get a UTD).
We need to take care of providing good UX for this

andybalaam commented 5 months ago

In a room with no history visibility, imagine these events happen (because there is a netsplit):

graph TD;
    Before-->Send;
    Send-->After;
    Before-->Join;
    Join-->After;

Before: User1 and User2 are members Send: User1 sends a message, encrypted for User2 Join: User3 joins the room After: netsplit is over

Imagine that Join happens well before Send in wall-clock time.

Does User3's client receive the event sent in Send? If so, it will always be a UTD because User3 was not a member at that point in the room graph, so it would be incorrect for User1 to send keys to decrypt it.

It would be great if we could have a test from @kegsay that demonstrates whether this event arrives at User3's client.

kegsay commented 5 months ago

I encountered this in the wild today. Matthew did not realise I was part of the room and hence did not encrypt his message for me. As a result, all my clients are unable to decrypt this message. See screenshots.

Element X Android: argh

Element Desktop:

In this case:

Matthew was on matrix.org
Hugh was on element.io
Hugh invited me at 12:37:50
Matthew sent his message at 12:38:13
A full 24s after he sends his message, my membership comes through.

Total lag: 47s.

I don't think the hash solution works here because, from matrix.org's pov:

Alice's server can then cross-reference the hash with the actual room membership once it has decided where in the DAG to hang the new event.

I am not in the room, hence the hash would always be valid (assuming the lag isn't due to intra-HS components). TODO: check the room to see if the room forked at this time to be sure.

I think we can all agree that in this case, it is an expected UTD and should be suppressed by the receiving client. The problem is how to reliably detect this. The lag could have been minutes or hours, causing >1 UTD from Matthew. Sending some extra information in the m.room.encrypted event feels like the only valid way to detect this, as only the client has the necessary data.

In the most naive solution, if the client added the entire set of user|device ID pairs to the event, my client could clearly see I'm not part of that list and hence suppress the event. However, large E2EE rooms with lots of members exist, as well as pathological users with many devices. This data could be large. We could use a bloom filter here to provided probablistic suppression, and then sent the bitmask with the event.

We can calculate the values here. Assuming we want:

1 in 10,000 chance of getting the wrong answer.
300 devices in total (total set of user|device pairs)
=> 719 bytes in the bitmask.

In extremely large rooms we could tweak these values:

1 in 100 change of getting the wrong answer.
4000 devices in total.
=> 4.68KB bitmask.

The sending logic would be something like:

Sending client collects user|device ID pairs.
Bitset all pairs into the bloom filter (latching the bit to 1)
Sends the final bitmask.

The receiving logic would be something like:

Receiving client gets bitmask.
Receiving client works out its user|device pair.
What is the value of that pair in the bloom filter?
- 1: set, therefore this message was sent for us.
- 0: unset, therefore this message was not sent for us, so suppress.

Critically, these properties means:

If the value is 0, the message definitely was not sent for us. This is important as we NEVER want to suppress an event which was sent for us.
If the value is 1, the message probably was sent for us (with the probabilities outlined above of guessing wrong).

ara4n commented 5 months ago

How about simply hiding all consecutive UTDs you receive after joining a room on the assumption that they're old messages which predate you joining?

kegsay commented 5 months ago

How about simply hiding all consecutive UTDs you receive after joining a room on the assumption that they're old messages which predate you joining?

Assuming the hiding is per-sender, rather than just the entire timeline (remember this was just you here having this problem), that won't reliably stop UTDs from being displayed. It definitely helps in this particular case (you joined and get UTDs) but there are related failure modes which this doesn't help with because there isn't a convenient start point (the join event) which you can use as a guide for when to start hiding messages. Notably, when you login on a new device, until that device is synchronised with the sender, you'll get UTDs from the sender. We could generalise this even further and just hide all UTDs in the room until you successfully decrypt a single event from that sender (though really it should be that device), but I worry that would mask a ton of real failure modes.

richvdh commented 5 months ago

How about simply hiding all consecutive UTDs you receive after joining a room on the assumption that they're old messages which predate you joining?

We did something similar in the past for historical messages, and it was extremely flaky. The problem with any heuristics around UTDs is that it's very easy to end up catching lots of other failure modes by accident, and hiding half the timeline.

kegsay commented 5 months ago

Clarity around https://github.com/element-hq/element-meta/issues/2268#issuecomment-1889756997 there's basically two orthogonal solutions being proposed, both of which help guard against UTDs:

The sender includes a hash of the member list. The sending server checks that this hash matches the server's view atomically when adding the event to the room DAG. If mismatch, reject the /send request. This hash does not need to be included in the event. This catches race conditions between the client and server where they disagree about member lists. I don't know how frequently this happens in the wild though, as the particular UTD I had was not due to this race.
The receiving server marks up the event (e.g in unsigned) to indicate if the user was joined to the room at the position in the DAG when the event was sent. This is basically Andy's graph https://github.com/element-hq/element-meta/issues/2268#issuecomment-1892412493 - and this is precisely the failure mode I had (thanks vdh for the viz). My client should never have got that message at all. To quote vdh:

The problem is that the timeline isn't linear, and information gets lost when you flatten the dag

richvdh commented 5 months ago

My client should never have got that message at all.

Or if it did (eg, because history visibility settings permit it), your server should have given your client a caveat along the lines of "your membership state was actually leave at the point of this event"

kegsay commented 5 months ago

your server should have given your client a caveat along the lines of "your membership state was actually leave at the point of this event"

This would be a lovely way to suppress UTDs of messages prior to joining in a more reliable way than entirely client-side imo. My concern with all of this though is it's only a partial solution. You can get the same failure mode when you login with a new device, and then the DAG markup solution falls short. The bloom filter approach would work for both joins and new devices, and handles race conditions through the entire stack (CS and federation). The downside is extra bandwidth consumed, but for very low bandwidth use cases you'd not be using E2EE at all anyway (as ciphertexts cannot be compressed well, and typically you'd rely on network-level encryption if you're that focused on byte saving).

richvdh commented 5 months ago

The downside is extra bandwidth consume

The other downsides are:

(a) It's probabalistic - the bloom filter cannot tell you with certainty if you were included in the recipient list
(b) Assuming the bloom filter tells you you weren't included, it doesn't tell you why. This is useful for UX and metrics.

I do think the bloom filter could be useful for some problems where we don't have better ideas (https://github.com/element-hq/synapse/issues/2165, maybe?) but I think there's room for both solutions.

kegsay commented 5 months ago

but I think there's room for both solutions.

I agree, but with limited time and developer effort, we surely want the biggest bang for our collective buck. All these solutions need MSCs. The sending hash would require client and server changes. The receiving server markup also requires client and server changes. Even with both of those, they don't help new device logins. The bloom filter just needs client changes, adding a few keys to the m.room.encrypted["content"] and helps for new device logins. The false positive rate can be tweaked, and given this is a UX solution, having a 1 in 10,000 chance of seeing it when these UTDs are pretty infrequent anyway may be acceptable enough to Product?

Assuming the bloom filter tells you you weren't included, it doesn't tell you why. This is useful for UX and metrics.

We could probably layer more bloom filters on top to include that information. E.g:

{
  "sent": 00010101... // the main "did I send this to your user|device key?" bloom filter
  "only_verified": 000101... // if you're in this map then this is because we only send to verified devices
  "no_otk": 0001... // if you're in this map then this is because there were no otks/fallback key available to use
}

This would then form multiple sets:

If in sent, hurray. If not..
In one of the other bloom filters? That's why. If not..
The sender didn't know about your user|device at the time the message was sent (e.g this github issue).

Because we assume these other filters are only going to contain a few elements if any, they can be very small whlist keep very low false positive rates: 1 in 10 million false positive for 20 entries is 84 bytes. This assumes that if you're in sent, you don't look in the other filters. You can also alternatively just keep the reason filters and not have the sent filter, in which case this becomes mostly diagnostic information.

richvdh commented 4 months ago

Step 2: It now crosses over with historical room key sharing (cf https://github.com/element-hq/element-web/issues/26867), but in short: when an event is served to Bob by Bob's server, we could include an indication of whether Bob was a member of the room at the point in the DAG of the event. This would give Bob's client an indication of whether it should expect to be able to decrypt the event.

https://github.com/matrix-org/matrix-spec-proposals/pull/4115 proposes a protocol change for this part of my idea.

uhoreg commented 4 months ago

Step 1: make sure that Alice's client's idea of the room membership matches that of the room DAG at the point of the sent message. We could do this by having the client submit a hash of what it thinks the room membership is, when it sends the message. Alice's server can then cross-reference the hash with the actual room membership once it has decided where in the DAG to hang the new event. If there is a mismatch, the server rejects the send attempt and Alice's client resyncs the membership somehow. ...

We have to be careful here: the sender needs to be aware that their message is going to be sent to people who were not in the room when they wrote the message. This may mean that we may need to do something like having the client prompt the sender to let them know about the membership changes and give them a chance to cancel sending.

element-hq / element-meta

Users who join an encrypted room at the same time as a message is sent may receive a UTD (join/send race) #2268