Synapse stops sending persistent events (other than join memberships) to application services

anoadragon453 commented 4 months ago

Description

There was a failure mode today on a customer's ESS deployment where their Synapse stopped sending PDUs to registered appservices. Ephemeral events (to-device, presence, read receipts) were sent, and curiously membership join events were being sent. But messages, m.room.create and membership leave events all weren't making it over (likely as join memberships have a separate notifier path). The deployment had no workers.

The synapse_event_processing_lag_by_event metric reported that the appservice_sender label was ~80K events behind (the build up starting ~20 days ago). We tried setting up workers on the deployment, but not an appservice worker. Then as the problem persisted, an appservice worker was configured and started, yet events were still not being sent. The synapse_event_processing_lag_by_event metric was also not being reported by the appservice worker, which means this code was not being reached:

https://github.com/element-hq/synapse/blob/23740eaa3da94fbe2e25180376961979fc4e8cd3/synapse/handlers/appservice.py#L179-L181

Looking in the database, we saw that the stream IDs (which concern ephemeral events) in the application_services_state table were fine and up to date. However, looking at the appservice_stream_position table revealed a stream ID of ~20K, while in the stream_positions table one could see the events stream was at ~80K. It had fallen behind, and was failing to catch up.

The only thing that worked was to manually update the stream position in appservice_stream_position to match the events row in the stream_positions table and restart the appservice worker. This lead us to believe that there's a deadlock case that's possible upon falling very far behind.

The code that compares the event stream positions and pulls the necessary events out of the database is:

https://github.com/element-hq/synapse/blob/23740eaa3da94fbe2e25180376961979fc4e8cd3/synapse/handlers/appservice.py#L121-L138

The suspicion is that either upper_bound or self.current_max were somehow incorrectly set. Some questions we're left with:

Why the appservice_sender stream got stuck in the first place?
Why it didn't unwedge itself after restarts of either the main process or (when it was brought into existance) the appservice worker
Why it required a worker restart once the DB was updated to go again

It's possible that this could be reproduced with a unit test with appropriately out of sync stream positions.

Synapse Version

Synapse 1.99.0+lts.3 for the full duration of the problem

Database

PostgreSQL

Workers

Multiple workers

Platform

Element Server Suite, so kubernetes.

Relevant log output

We didn't spot any relevant logs.

Half-Shot commented 4 months ago

and curiously membership join events were being sent.

I find this especially odd, given the stream position wasn't moving.

Also, we found that the appservice_stream_position has an unused(?) lock column that we couldn't match to any code in Synapse.

anoadragon453 commented 4 months ago

Potentially a duplicate of https://github.com/matrix-org/synapse/issues/13950.

Half-Shot commented 1 month ago

@anoadragon453 did we think this was related to the timing out requests for user queries in the end?

anoadragon453 commented 1 month ago

@Half-Shot yes indeed. For context, this turned out to be due to requests from Synapse to the application service for user lookup queries timing out. Each of these calls to self._check_user_exists has a timeout of 60s:

https://github.com/element-hq/synapse-private/blob/550d760364bf79147046c7c939af59db06ae17f1/synapse/handlers/appservice.py#L179-L187

and this code runs for every event that is meant to be sent an AS. You can see why the AS's queuer may get backed up here as new events are created!

I think it's best to close this issue, and instead create a new one with suggestions for how to better handle a poor network connection between Synapse and an AS in this case.

element-hq / synapse