Closed anoadragon453 closed 1 month ago
and curiously membership join events were being sent.
I find this especially odd, given the stream position wasn't moving.
Also, we found that the appservice_stream_position
has an unused(?) lock column that we couldn't match to any code in Synapse.
Potentially a duplicate of https://github.com/matrix-org/synapse/issues/13950.
@anoadragon453 did we think this was related to the timing out requests for user queries in the end?
@Half-Shot yes indeed. For context, this turned out to be due to requests from Synapse to the application service for user lookup queries timing out. Each of these calls to self._check_user_exists
has a timeout of 60s:
and this code runs for every event that is meant to be sent an AS. You can see why the AS's queuer may get backed up here as new events are created!
I think it's best to close this issue, and instead create a new one with suggestions for how to better handle a poor network connection between Synapse and an AS in this case.
Description
There was a failure mode today on a customer's ESS deployment where their Synapse stopped sending PDUs to registered appservices. Ephemeral events (to-device, presence, read receipts) were sent, and curiously membership join events were being sent. But messages,
m.room.create
and membership leave events all weren't making it over (likely as join memberships have a separate notifier path). The deployment had no workers.The
synapse_event_processing_lag_by_event
metric reported that theappservice_sender
label was ~80K events behind (the build up starting ~20 days ago). We tried setting up workers on the deployment, but not an appservice worker. Then as the problem persisted, an appservice worker was configured and started, yet events were still not being sent. Thesynapse_event_processing_lag_by_event
metric was also not being reported by the appservice worker, which means this code was not being reached:https://github.com/element-hq/synapse/blob/23740eaa3da94fbe2e25180376961979fc4e8cd3/synapse/handlers/appservice.py#L179-L181
Looking in the database, we saw that the stream IDs (which concern ephemeral events) in the
application_services_state
table were fine and up to date. However, looking at theappservice_stream_position
table revealed a stream ID of ~20K, while in thestream_positions
table one could see theevents
stream was at ~80K. It had fallen behind, and was failing to catch up.The only thing that worked was to manually update the stream position in
appservice_stream_position
to match theevents
row in thestream_positions
table and restart the appservice worker. This lead us to believe that there's a deadlock case that's possible upon falling very far behind.The code that compares the event stream positions and pulls the necessary events out of the database is:
https://github.com/element-hq/synapse/blob/23740eaa3da94fbe2e25180376961979fc4e8cd3/synapse/handlers/appservice.py#L121-L138
The suspicion is that either
upper_bound
orself.current_max
were somehow incorrectly set. Some questions we're left with:appservice_sender
stream got stuck in the first place?It's possible that this could be reproduced with a unit test with appropriately out of sync stream positions.
Synapse Version
Synapse 1.99.0+lts.3 for the full duration of the problem
Database
PostgreSQL
Workers
Multiple workers
Platform
Element Server Suite, so kubernetes.
Relevant log output