element-hq / element-call

Group calls powered by Matrix
https://call.element.io
Apache License 2.0
554 stars 88 forks source link

Knocking failed #2342

Closed ara4n closed 3 months ago

ara4n commented 4 months ago

Steps to reproduce

  1. Amandine tried to knock
  2. @AndrewFerr didn't see the knock to let her in
  3. She was lost, LOST, lost

Outcome

What did you expect?

Amandine not to get lost

What happened instead?

She was lost

Operating system

No response

Browser information

No response

URL for webapp

call.guest.ess-ecall-poc.ems-support.element.dev

Will you send logs?

Yes

toger5 commented 4 months ago

Did it eventually converge? @AndrewFerr what was the knocking federation issue talked about last week? Could that be realted?

AndrewFerr commented 4 months ago

@AndrewFerr what was the knocking federation issue talked about last week? Could that be realted?

No, that was the spec issue of knock rejections/cancellations not working over federation.

Amandine's member events on each server are as follows:

Guest homeserver

This is the homeserver that A used to create her account & send the knock.

{
  "type": "m.room.member",
  "sender": "@af:ess-ecall-poc.ems-support.element.dev",
  "content": {
    "displayname": "Amandine",
    "membership": "invite"
  },
  "state_key": "@cyan-scientific-wallaby:guest.ess-ecall-poc.ems-support.element.dev",
  "origin_server_ts": 1714146824085,
  "unsigned": {
    "age": 9
  },
  "event_id": "$_EcVT4f8HiexSmVxYgBe7YksV0O3pVfR5D6gSCztbtM",
  "room_id": "!kZfediMOZQzfcRWOZN:ess-ecall-poc.ems-support.element.dev",
  "user_id": "@af:ess-ecall-poc.ems-support.element.dev",
  "age": 9
}

Curiously, the initial "membership": "knock" event is missing...

Main homeserver

This is the homeserver where I accepted knock requests.

{
  "type": "m.room.member",
  "sender": "@cyan-scientific-wallaby:guest.ess-ecall-poc.ems-support.element.dev",
  "content": {
    "avatar_url": null,
    "membership": "knock",
    "displayname": "Amandine"
  },
  "state_key": "@cyan-scientific-wallaby:guest.ess-ecall-poc.ems-support.element.dev",
  "origin_server_ts": 1714146737878,
  "unsigned": {
    "age": 6
  },
  "event_id": "$N19pasFE4QhwEiZu3kiCaECHZkOsNyfM7iMUJbTfGI4",
  "room_id": "!kZfediMOZQzfcRWOZN:ess-ecall-poc.ems-support.element.dev",
  "user_id": "@cyan-scientific-wallaby:guest.ess-ecall-poc.ems-support.element.dev",
  "age": 6
}
{
  "type": "m.room.member",
  "sender": "@af:ess-ecall-poc.ems-support.element.dev",
  "content": {
    "displayname": "Amandine",
    "membership": "invite"
  },
  "state_key": "@cyan-scientific-wallaby:guest.ess-ecall-poc.ems-support.element.dev",
  "origin_server_ts": 1714146824085,
  "unsigned": {
    "replaces_state": "$N19pasFE4QhwEiZu3kiCaECHZkOsNyfM7iMUJbTfGI4",
    "prev_content": {
      "avatar_url": null,
      "displayname": "Amandine",
      "membership": "knock"
    },
    "prev_sender": "@cyan-scientific-wallaby:guest.ess-ecall-poc.ems-support.element.dev",
    "age": 597
  },
  "event_id": "$_EcVT4f8HiexSmVxYgBe7YksV0O3pVfR5D6gSCztbtM",
  "room_id": "!kZfediMOZQzfcRWOZN:ess-ecall-poc.ems-support.element.dev"
}

So it turns out that I did see & accept A's invite after all: image

...and the real issue is that the guest homeserver somehow didn't see A's knock!

ara4n commented 4 months ago

this is a little terrifying, and a great smoking gun to pursue. perhaps @kegsay could suggest a way to use complement to provide a failing test

AndrewFerr commented 4 months ago

@fkwp and I discussed this, and our leading theory is that this might be knocking federation issue after all.

When a user sends a federated knock, they still can't sync any events in that room until they're invited to it. So what might have happened is that the ECall client somehow forgot about its own knock & was unable to remember it via a sync, and thus didn't know what to do with the invite it eventually received.

We should investigate how ECall remembers knocks. Ideally, it should have an accurate local cache of who sent knocks to which rooms & never needs to ask the server about them (thus bypassing the federation issue).

We should also consider how ECall should respond to invites for rooms that a user hasn't sent a knock to. If ECall were to handle/accept arbitrary invites, then the case of a knock being forgotten would be recoverable, as the end goal of a user being invited to a room one way or another would still happen.

AndrewFerr commented 4 months ago

Turns out that /sync makes a special consideration for knocks:

The knock will appear as an entry in the response of the /sync API. [source]

This is good news because it means ECall (or any client for that matter) doesn't have to keep its own cache of knocks it has sent. Even when refreshing the page, ECall remembers when it sent a knock, so it must be figuring that out via /sync.

But once a knock is responded to with an invite, ECall will respond to it only if the page is left untouched. Refreshing the page (and presumably whatever condition caused A's failure case to happen) has the next /sync retrieve an invite instead of a knock, which ECall doesn't treat as a condition to join the room, so it offers to knock again -- which will fail with a 403 because the user is already invited!

So, probably the best solution is to make ECall smarter about when to respond to a /synced invite with a join (or an offer to join), which is what the spec recommends anyhow:

Homeservers are permitted to automatically accept invites as a result of knocks as they should be aware of the user’s intent to join the room. If the homeserver is not auto-accepting invites (or there was an unrecoverable problem with accepting it), the invite is expected to be passed down normally to the client to handle. [source]

toger5 commented 4 months ago

Since we do know which room the user is trying to join/ (its in the url) we just first see if there is an invite and then knock? Is it that simple?

AndrewFerr commented 4 months ago

It's the opposite: if there is an invite, there is no need to follow up with a knock (in fact, knocking on a room you've been invited to fails with a 403).

What an invite should be followed up with is a join (or a button asking the user to join). Ideally this would be done only for rooms the user had knocked on, which can be checked by looking up the prior value of the m.room.member state event (with the user's ID as the state_key), either with prev_content or replaces_state.

AndrewFerr commented 4 months ago

Ah, my mistake, ECall does auto-accept invites in response to knocks. The "join" button it shows is for joining a call, not joining a room.

This means all we need is for ECall to know that visiting a knock room you've already been invited to should join you to that room, instead of offering you to send a (redundant & fruitless) knock.

AndrewFerr commented 4 months ago

https://github.com/element-hq/element-call/pull/2355 helps with the case of visiting an ECall page for a knock room you're already invited to.

The other half is to have ECall join auto-accept invites to rooms you've knocked on. ECall already uses callbacks to handle this, but they're lost on page refresh. To make this more resilient, ECall should use the initial /sync on page load to retrieve all knocked rooms & set the callback on each of them.

AndrewFerr commented 4 months ago

https://github.com/element-hq/element-call/pull/2361 should help with the rest of this. It was less work than expected thanks to ECall's existing callbacks already handling most auto-join scenarios.

AndrewFerr commented 3 months ago

Both PRs mentioned above are merged, so knocking flows should be much more reliable now. Closing until we discover otherwise :slightly_smiling_face: