E2EE sending is slowed down badly by devices on distant/slow servers.

ara4n commented 1 year ago

Steps to reproduce

Send a message in a new megolm session to an E2EE room with lots of servers
Discover that the process of setting up Olm sessions and claiming keys on servers in the room which are distant or struggling can take a long time, slowing down message send for all users in the room, even if they are fast local users.
Therefore, bad servers have the ability to DoS message sending in E2EE rooms.
The problem is even worse if you retry setting up Olm to a dead server (or OTK-expired user) every time you try to send a new message in the room: https://github.com/vector-im/element-web/issues/26375

Outcome

What did you expect?

Message sending to healthy devices shouldn't be blocked behind slow sending to unhealthy devices.

Are we sure that we're doing the right thing by only showing a message as sent once the full olm setup + megolm share process has completed?

Pros:
- By the time the message is sent, the user knows their client has encrypted and sent it to everyone it could successfully contact.
Cons:
- This could take ages, as servers which are slow/down/distant will take a long time to respond to /key/claims - and the user may care more about sending the message to 'healthy' devices than blocking the healthy devices on trying to contact unhealthy ones.

I wonder whether a better UX would be to track the health of devices (e.g. ones where in the past we've been able to set up Olm in less than a second) and show messages as sent (and send the m.room.encrypted event) once we've shared keys to those healthy devices. We would then try to share with the unhealthy devices slowly in the background. This could be shown to the user as an additional send state per message, or a global "syncing keys" state for the client as a whole.

As a result, the app would feel faster to send messages, healthy clients would get them sooner, but on the other hand unhealthy clients might see UTDs for longer while waiting for the keys to eventually get sent.

What happened instead?

Minor DoS vector, and slow E2EE.

Operating system

No response

Application version

No response

How did you install the app?

No response

Homeserver

No response

Will you send logs?

No

richvdh commented 1 year ago

To be clear, it sounds like the proposal is:

When attempting to send an encrypted message: if there exists at least one device in the room with whom we already have an established olm session:

send the megolm key to those devices (if necessary),
send the m.room.encrypted in-room message,
mark the message as (partially?) sent

... before (or in parallel with) establishing olm sessions for other devices and sending the megolm keys to them.

I think the main downside of this is that any devices in the room with whom we do not already have an established olm session will receive the room message before the keys, which they will perceive as a UISI.

richvdh commented 1 year ago

track the health of devices (e.g. ones where in the past we've been able to set up Olm in less than a second)

this isn't terribly useful information. You only need to set up Olm once per device, so whether it happened quickly is pretty irrelevant

ell1e commented 11 months ago

Sorry if my feedback here isn't wanted as an end user. But maybe the criterion could be something more dynamic like, either managed to A. send it to at least 25% of users in the room but at least 5 (to not make it too weird for small rooms), or B. tried to send it for at least something like 2ish seconds while reaching at least one other user's home server so internet is clearly not down? Whatever is met earlier, and then it's shown as sent? I don't think the average user in a large group chat expects any whatever random niche home server to always be reached by the time the message is shown as sent, but that at least a reasonable subset got it with a reasonable effort and not the whole network is down or whatever. Also, I've seen so many times people saying right now encrypted group chats are basically unusable for large rooms due to how sluggish they get, and I imagine this issue here is potentially a bigger reason for that.

Edit: I think the average user would also maybe think key propagation attempts continue if it didn't reach everyone yet, even if they close down the client after the message successfully was shown sent. If that's too rife for abuse to actually do, at the very least it should be picked up again any time the user's own client comes back up and the target client's homeserver is also online, if it hasn't been unsuccessfully tried for something like literal hours yet.

richvdh commented 11 months ago

@ell1e I'm afraid I'm struggling to understand exactly what you are proposing. Could you maybe structure it more clearly?

element-hq / element-meta