element-hq / element-call

Group calls powered by Matrix
https://call.element.io
Apache License 2.0
535 stars 84 forks source link

Joining a room call in embedded mode sometimes fails within Element Web #2458

Open hughns opened 1 week ago

hughns commented 1 week ago

When running Element Call embedded within Element Web using the "New group call experience" labs feature I sometimes see a behaviour where I attempt to start a room call and after pressing "Join call" I get returned to the timeline rather than actually entering the call.

In the console you see The widget died; treating this as a user hangup logged

@toger5 had spotted this previously and sent messages in #matrix-dev about it. The message describing when this was observed matches my own observations:

The reason I am looking into this is that Calls sometimes just stop when clicking the join button in the lobby. I noticed that this makes sense since the condition for them to stop is that the state changes to no memberships memberships=[].

Here is how it looks from a UX point of view:

https://github.com/element-hq/element-call/assets/6955675/30d6fa12-9838-4ca3-af40-5a8178dd4dbb

When looking at it I also came to the conclusion that the sequence of room rate events being (re-)emitted and (re-)processed by https://github.com/matrix-org/matrix-js-sdk/blob/d90292bff5ed026fba72b1ffa8b5284b1fb76577/src/matrixrtc/MatrixRTCSessionManager.ts#L133-L148 that causes an issue.

How I originally came on to this issue was that I was looking at https://github.com/element-hq/element-call/issues/2415 and found that the key distribution failed after encountering this behaviour. I will add some context on that issue shortly too.

Possibly related issues:

Thoughts:

erikjohnston commented 6 days ago
The HAR of the sync requests ``` https://call-unstable.ems.host/_matrix/client/v3/sync?filter=0&timeout=30000&set_presence=online&since=s1002136_1_12_1333_4049_1_426_5717_0_1 { "next_batch": "s1002138_1_12_1333_4050_1_426_5717_0_1", "timeline": { "events": [ { "type": "org.matrix.msc3401.call.member", "sender": "@hughns2:call-unstable.ems.host", "state_key": "@hughns2:call-unstable.ems.host", "unsigned": { "replaces_state": "$91MdSs1OcBO5xguVL2zeNhBQlo3hAt_DyFaEZkW8SfQ", }, "event_id": "$q5m5Fk6Q0WuGC6PAR5QqrfnFPnXqz30amvy3vpf5NqM" } --------------- https://call-unstable.ems.host/_matrix/client/v3/sync?filter=0&timeout=30000&set_presence=online&since=s1002138_1_12_1333_4050_1_426_5717_0_1 { "next_batch": "s1002140_1_12_1334_4051_1_426_5717_0_1", "state": { "events": [ { "type": "org.matrix.msc3401.call.member", "sender": "@hughns2:call-unstable.ems.host", "state_key": "@hughns2:call-unstable.ems.host", "unsigned": { "replaces_state": "$9mFjm_Ox58jeAvdSL8OWlSShagn65niqyLLAHX23QGs", }, "event_id": "$91MdSs1OcBO5xguVL2zeNhBQlo3hAt_DyFaEZkW8SfQ" } ] }, "ephemeral": { "events": [ { "type": "m.receipt", "content": { "$q5m5Fk6Q0WuGC6PAR5QqrfnFPnXqz30amvy3vpf5NqM": { "m.read": { "@hughns2:call-unstable.ems.host": { "ts": 1719851861973, "thread_id": "main" } } ---------------- https://call-unstable.ems.host/_matrix/client/v3/sync?filter=0&timeout=30000&set_presence=online&since=s1002143_1_12_1335_4052_1_426_5717_0_1 { "next_batch": "s1002146_1_12_1335_4052_1_426_5717_0_1", "timeline": { "events": [ { "type": "org.matrix.msc3401.call.member", "sender": "@hughns2:call-unstable.ems.host", "state_key": "@hughns2:call-unstable.ems.host", "unsigned": { "replaces_state": "$q5m5Fk6Q0WuGC6PAR5QqrfnFPnXqz30amvy3vpf5NqM", } "event_id": "$aemXSHbzKzePfTDzuX00mRFZJmJIJFdQ3I55EPllOUY" } ----------------------------------- https://call-unstable.ems.host/_matrix/client/v3/sync?filter=0&timeout=30000&set_presence=online&since=s1002146_1_12_1335_4052_1_426_5717_0_1 { "next_batch": "s1002148_1_12_1335_4052_1_426_5717_0_1", "state": { "events": [ { "type": "org.matrix.msc3401.call.member", "sender": "@hughns2:call-unstable.ems.host", "state_key": "@hughns2:call-unstable.ems.host", "origin_server_ts": 1719851861110, "unsigned": { "replaces_state": "$91MdSs1OcBO5xguVL2zeNhBQlo3hAt_DyFaEZkW8SfQ", }, "event_id": "$q5m5Fk6Q0WuGC6PAR5QqrfnFPnXqz30amvy3vpf5NqM" } ---------------- https://call-unstable.ems.host/_matrix/client/v3/sync?filter=0&timeout=30000&set_presence=online&since=s1002148_1_12_1335_4052_1_426_5717_0_1 { "next_batch": "s1002149_1_12_1335_4052_1_426_5717_0_1", "timeline": { "events": [ { "type": "org.matrix.msc3401.call.member", "sender": "@hughns2:call-unstable.ems.host", "state_key": "@hughns2:call-unstable.ems.host", "unsigned": { "replaces_state": "$aemXSHbzKzePfTDzuX00mRFZJmJIJFdQ3I55EPllOUY", }, "event_id": "$8LDSegFL8fkNUaNI_Jc2pgXla17t2FUmsE6JusnYCyU" } ], -------------------- ```

This can be summarised as the following changes:

timeline: $91M -> $q5m
state:    $9mF -> $91M
timeline: $q5m -> $aem
state:    $91M -> $q5m
timeline: $aem -> $8LD
Looking at the DB we get ``` select stream_ordering, event_id, type, state_key, (select array_agg(prev_event_id) from event_edges where events.event_id = event_edges.event_id), (select state_group from event_to_state_groups as sg where events.event_id = sg.event_id) from events where room_id = '!totKtECNNKyhzLOiNl:call-unstable.ems.host' and 1002136 < stream_ordering and stream_ordering <= 1002149 order by stream_ordering; stream_ordering | event_id | type | state_key | array_agg | state_group -----------------+----------------------------------------------+--------------------------------+---------------------------------+---------------------------------------------------------------------------------------------+------------- 1002136 | $YPhOcZiyzEJyMHsrPFtJKxCljsZzQZOFjOE2k7_9am0 | org.matrix.msc3401.call.member | @hughns:call-unstable.ems.host | {$89_MkkYqF-ViNH0qGkqmQyb5LPnslb4VqQarURn0iUA} | 37615 1002138 | $q5m5Fk6Q0WuGC6PAR5QqrfnFPnXqz30amvy3vpf5NqM | org.matrix.msc3401.call.member | @hughns2:call-unstable.ems.host | {$YPhOcZiyzEJyMHsrPFtJKxCljsZzQZOFjOE2k7_9am0} | 37616 1002139 | $tIhui6kBfzQDGV5FaZZJn_yKsPqeUwM22JeUIikdFRw | m.room.encrypted | | {$YPhOcZiyzEJyMHsrPFtJKxCljsZzQZOFjOE2k7_9am0} | 37615 1002140 | $bTqD6YwK-cwSv4-F6G7_BlWgItaUUssQ5bRFAET-ReA | m.room.encrypted | | {$q5m5Fk6Q0WuGC6PAR5QqrfnFPnXqz30amvy3vpf5NqM,$tIhui6kBfzQDGV5FaZZJn_yKsPqeUwM22JeUIikdFRw} | 37616 1002146 | $aemXSHbzKzePfTDzuX00mRFZJmJIJFdQ3I55EPllOUY | org.matrix.msc3401.call.member | @hughns2:call-unstable.ems.host | {$bTqD6YwK-cwSv4-F6G7_BlWgItaUUssQ5bRFAET-ReA} | 37617 1002147 | $ugzpWM2F6VRnEcpnp0dYSzfXnKKRyTCHH-TQc9dcwcM | m.room.encrypted | | {$bTqD6YwK-cwSv4-F6G7_BlWgItaUUssQ5bRFAET-ReA} | 37616 1002148 | $xH5t5qKv0tQO7_wUdxoatp9pz-YsV_wnNSLxntLYuqg | m.room.encrypted | | {$aemXSHbzKzePfTDzuX00mRFZJmJIJFdQ3I55EPllOUY,$ugzpWM2F6VRnEcpnp0dYSzfXnKKRyTCHH-TQc9dcwcM} | 37617 1002149 | $8LDSegFL8fkNUaNI_Jc2pgXla17t2FUmsE6JusnYCyU | org.matrix.msc3401.call.member | @hughns2:call-unstable.ems.host | {$xH5t5qKv0tQO7_wUdxoatp9pz-YsV_wnNSLxntLYuqg} | 37618 (8 rows) ```

Which is roughly:

weird_sync dot

Where bottom is most recent and background colour is the state groups.

If you look at the DB output, which is ordered by stream_ordering you can see that the state keeps bouncing back and forth. This is likely confusing the algorithm for computing the state deltas to send to clients, due to poor handling of these cases in Synapse. The fix for this is probably to move to tracking state changes via current state, but that is not a small change.