Open richvdh opened 5 months ago
I think we could address this with something like the following:
Step 1: make sure that Alice's client's idea of the room membership matches that of the room DAG at the point of the sent message. We could do this by having the client submit a hash of what it thinks the room membership is, when it sends the message. Alice's server can then cross-reference the hash with the actual room membership once it has decided where in the DAG to hang the new event. If there is a mismatch, the server rejects the send attempt and Alice's client resyncs the membership somehow. (If https://github.com/matrix-org/matrix-spec/issues/1209 (edit: and https://github.com/element-hq/synapse/issues/16940#issuecomment-1956348215) were fixed, that might just be a matter of having Alice's client redo the encryption attempt after a new /sync
, but more generally a /members
request might be necessary.)
Step 2: It now crosses over with historical room key sharing (cf https://github.com/element-hq/element-web/issues/26867), but in short: when an event is served to Bob by Bob's server, we could include an indication of whether Bob was a member of the room at the point in the DAG of the event. This would give Bob's client an indication of whether it should expect to be able to decrypt the event.
Outcome from workshop today
In a room with no history visibility, imagine these events happen (because there is a netsplit):
graph TD;
Before-->Send;
Send-->After;
Before-->Join;
Join-->After;
Before: User1 and User2 are members Send: User1 sends a message, encrypted for User2 Join: User3 joins the room After: netsplit is over
Imagine that Join happens well before Send in wall-clock time.
Does User3's client receive the event sent in Send
? If so, it will always be a UTD because User3 was not a member at that point in the room graph, so it would be incorrect for User1 to send keys to decrypt it.
It would be great if we could have a test from @kegsay that demonstrates whether this event arrives at User3's client.
I encountered this in the wild today. Matthew did not realise I was part of the room and hence did not encrypt his message for me. As a result, all my clients are unable to decrypt this message. See screenshots.
Element X Android:
Element Desktop:
In this case:
Total lag: 47s.
I don't think the hash solution works here because, from matrix.org's pov:
Alice's server can then cross-reference the hash with the actual room membership once it has decided where in the DAG to hang the new event.
I am not in the room, hence the hash would always be valid (assuming the lag isn't due to intra-HS components). TODO: check the room to see if the room forked at this time to be sure.
I think we can all agree that in this case, it is an expected UTD and should be suppressed by the receiving client. The problem is how to reliably detect this. The lag could have been minutes or hours, causing >1 UTD from Matthew. Sending some extra information in the m.room.encrypted
event feels like the only valid way to detect this, as only the client has the necessary data.
In the most naive solution, if the client added the entire set of user|device ID pairs to the event, my client could clearly see I'm not part of that list and hence suppress the event. However, large E2EE rooms with lots of members exist, as well as pathological users with many devices. This data could be large. We could use a bloom filter here to provided probablistic suppression, and then sent the bitmask with the event.
We can calculate the values here. Assuming we want:
In extremely large rooms we could tweak these values:
The sending logic would be something like:
The receiving logic would be something like:
Critically, these properties means:
How about simply hiding all consecutive UTDs you receive after joining a room on the assumption that they're old messages which predate you joining?
How about simply hiding all consecutive UTDs you receive after joining a room on the assumption that they're old messages which predate you joining?
Assuming the hiding is per-sender, rather than just the entire timeline (remember this was just you here having this problem), that won't reliably stop UTDs from being displayed. It definitely helps in this particular case (you joined and get UTDs) but there are related failure modes which this doesn't help with because there isn't a convenient start point (the join event) which you can use as a guide for when to start hiding messages. Notably, when you login on a new device, until that device is synchronised with the sender, you'll get UTDs from the sender. We could generalise this even further and just hide all UTDs in the room until you successfully decrypt a single event from that sender (though really it should be that device), but I worry that would mask a ton of real failure modes.
How about simply hiding all consecutive UTDs you receive after joining a room on the assumption that they're old messages which predate you joining?
We did something similar in the past for historical messages, and it was extremely flaky. The problem with any heuristics around UTDs is that it's very easy to end up catching lots of other failure modes by accident, and hiding half the timeline.
Clarity around https://github.com/element-hq/element-meta/issues/2268#issuecomment-1889756997 there's basically two orthogonal solutions being proposed, both of which help guard against UTDs:
/send
request. This hash does not need to be included in the event. This catches race conditions between the client and server where they disagree about member lists. I don't know how frequently this happens in the wild though, as the particular UTD I had was not due to this race.unsigned
) to indicate if the user was joined to the room at the position in the DAG when the event was sent. This is basically Andy's graph https://github.com/element-hq/element-meta/issues/2268#issuecomment-1892412493 - and this is precisely the failure mode I had (thanks vdh for the viz). My client should never have got that message at all. To quote vdh:
The problem is that the timeline isn't linear, and information gets lost when you flatten the dag
My client should never have got that message at all.
Or if it did (eg, because history visibility settings permit it), your server should have given your client a caveat along the lines of "your membership state was actually leave
at the point of this event"
your server should have given your client a caveat along the lines of "your membership state was actually leave at the point of this event"
This would be a lovely way to suppress UTDs of messages prior to joining in a more reliable way than entirely client-side imo. My concern with all of this though is it's only a partial solution. You can get the same failure mode when you login with a new device, and then the DAG markup solution falls short. The bloom filter approach would work for both joins and new devices, and handles race conditions through the entire stack (CS and federation). The downside is extra bandwidth consumed, but for very low bandwidth use cases you'd not be using E2EE at all anyway (as ciphertexts cannot be compressed well, and typically you'd rely on network-level encryption if you're that focused on byte saving).
The downside is extra bandwidth consume
The other downsides are:
I do think the bloom filter could be useful for some problems where we don't have better ideas (https://github.com/element-hq/synapse/issues/2165, maybe?) but I think there's room for both solutions.
but I think there's room for both solutions.
I agree, but with limited time and developer effort, we surely want the biggest bang for our collective buck. All these solutions need MSCs. The sending hash would require client and server changes. The receiving server markup also requires client and server changes. Even with both of those, they don't help new device logins. The bloom filter just needs client changes, adding a few keys to the m.room.encrypted["content"]
and helps for new device logins. The false positive rate can be tweaked, and given this is a UX solution, having a 1 in 10,000 chance of seeing it when these UTDs are pretty infrequent anyway may be acceptable enough to Product?
Assuming the bloom filter tells you you weren't included, it doesn't tell you why. This is useful for UX and metrics.
We could probably layer more bloom filters on top to include that information. E.g:
{
"sent": 00010101... // the main "did I send this to your user|device key?" bloom filter
"only_verified": 000101... // if you're in this map then this is because we only send to verified devices
"no_otk": 0001... // if you're in this map then this is because there were no otks/fallback key available to use
}
This would then form multiple sets:
sent
, hurray. If not..Because we assume these other filters are only going to contain a few elements if any, they can be very small whlist keep very low false positive rates: 1 in 10 million false positive for 20 entries is 84 bytes. This assumes that if you're in sent
, you don't look in the other filters. You can also alternatively just keep the reason filters and not have the sent
filter, in which case this becomes mostly diagnostic information.
Step 2: It now crosses over with historical room key sharing (cf https://github.com/element-hq/element-web/issues/26867), but in short: when an event is served to Bob by Bob's server, we could include an indication of whether Bob was a member of the room at the point in the DAG of the event. This would give Bob's client an indication of whether it should expect to be able to decrypt the event.
https://github.com/matrix-org/matrix-spec-proposals/pull/4115 proposes a protocol change for this part of my idea.
Step 1: make sure that Alice's client's idea of the room membership matches that of the room DAG at the point of the sent message. We could do this by having the client submit a hash of what it thinks the room membership is, when it sends the message. Alice's server can then cross-reference the hash with the actual room membership once it has decided where in the DAG to hang the new event. If there is a mismatch, the server rejects the send attempt and Alice's client resyncs the membership somehow. ...
We have to be careful here: the sender needs to be aware that their message is going to be sent to people who were not in the room when they wrote the message. This may mean that we may need to do something like having the client prompt the sender to let them know about the membership changes and give them a chance to cancel sending.
This is exacerbated by federation, but the same thing can happen even on a single server. Ultimately, there is always a race between Alice's client calculating/fetching the membership list, and the message being sent.
Now: it's not clear that Bob should be able to decrypt a message that was sent 3 seconds after he joined the room — at a point where Alice didn't even know he was in the room — any more than he should be able to decrypt a message sent before he joined the room. The problem here is that the message is perceived as a product defect (and contributes to our "user saw a UISI" metric). So probably what we want to do is convey to Bob's client that it should not expect to be able to decrypt this message.