element-hq / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://element-hq.github.io/synapse
GNU Affero General Public License v3.0
1.48k stars 182 forks source link

Upgrading large rooms takes 1000s+, times out resulting in confusing client error #17530

Open ara4n opened 2 months ago

ara4n commented 2 months ago

Description

Upgrading large rooms takes more than 10s, but after 10s synapse emits some kind of timeout error, leaving the client in a mess.

Steps to reproduce

Homeserver

matrix.org

Synapse Version

Whatever was live 2024-07-29T14:23:58.642Z

Installation Method

Other (please mention below)

Database

postgres

Workers

Multiple workers

Platform

debian

Configuration

No response

Relevant log output

Go look at the logs, if they're still there...

Anything else that would be useful to know?

No response

anoadragon453 commented 2 months ago

Here are the logs from Synapse:

3670827:2024-07-29 14:53:13,497 - synapse.http.server - 782 - WARNING - POST-8aadc2a6593fbef6-LHR - Not sending response to request <XForwardedForRequest at 0x7f18a38b3ca0 method='POST' uri='/_matrix/client/v3/rooms/!YTvKGNlinIzlkMTVRl%3Amatrix.org/upgrade' clientproto='HTTP/1.1' site='8080'>, already disconnected.
3670828:2024-07-29 14:53:13,498 - synapse.access.http.8080 - 473 - INFO - POST-8aadc2a6593fbef6-LHR - 77.108.142.78 - 8080 - {@matthew:matrix.org} Processed request: 1854.869sec/-1674.869sec (132.897sec, 15.687sec) (90.976sec/342.120sec/42165) 0B 200! "POST /_matrix/client/v3/rooms/!YTvKGNlinIzlkMTVRl%3Amatrix.org/upgrade HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) ElementNightly/2024072901 Chrome/126.0.6478.127 Electron/31.2.1 Safari/537.36" [35554 dbevts]

Parsing the request log, we can see that the request took 1854s to complete. It looks like Synapse did not time out the request - but potentially the reverse proxy? Either way, this is exceedingly long.

It looks like this request took a long time due to copying over user's push rules and account data from this room to the other. The logs are littered with:

...
Waiting for repl stream 'account_data' to reach 3244591758 (account_data-1); currently at: 3244591756
Waiting for repl stream 'push_rules' to reach 236699064 (push_rules-1); currently at: 236699062
...

This appears to happen user-by-user sequentially over replication:

https://github.com/element-hq/synapse/blob/ceb3686dcde1f22cd28cee17a0922b6a382563c2/synapse/handlers/room_member.py#L1344-L1389

The main process handles the room upgrade request, while the push-rules-1 and account-data-1 workers handles those changes respectively.

We could shave a lot of time and resources off processing this request by just doing those two operations in larger batches than 1.