element-hq / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://element-hq.github.io/synapse
GNU Affero General Public License v3.0
1.24k stars 153 forks source link

Presence change storm putting increased load on server #16843

Open FlyveHest opened 7 months ago

FlyveHest commented 7 months ago

Description

Six hours ago, my private Synapse server suddenly started putting 5 times the load on my server than usual.

I have narrowed it down to what seems about a handful of clients, based on IP, that GET and OPTION around 150 presence changes every second.

Users are running either Element Web latest version, or official clients on phones or desktop.

Searching issues, this might seem to somewhat related : #16705

Steps to reproduce

I don't know, the flood started 6 hours ago with no change in neither server nor clients (from what I can tell) and have been running since.

Restarting the server did nothing, the presence changes continued as soon as I restarted it.

Also tried waiting a minute before starting, same issue.

Homeserver

matrix.gladblad.dk

Synapse Version

1.99.0

Installation Method

Docker (matrixdotorg/synapse)

Database

SQLite

Workers

Single process

Platform

Ubuntu 20, in docker

Configuration

No response

Relevant log output

x.x.x.x - - [23/Jan/2024:17:53:32 +0100] "GET /_matrix/client/v3/sync?filter=2&timeout=30000&set_presence=unavailable&since=s322387_6844284_4_361249_403590_75_6308_2754_0_1 HTTP/1.1" 200 916 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 OPR/106.0.0.0"
x.x.x.x - - [23/Jan/2024:17:53:32 +0100] "GET /_matrix/client/v3/sync?filter=3&timeout=30000&set_presence=unavailable&since=s322387_6844284_4_361249_403590_75_6308_2754_0_1 HTTP/1.1" 200 916 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
x.x.x.x - - [23/Jan/2024:17:53:32 +0100] "OPTIONS /_matrix/client/v3/sync?filter=3&timeout=30000&set_presence=unavailable&since=s322387_6844285_4_361249_403590_75_6308_2754_0_1 HTTP/1.1" 204 406 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
x.x.x.x - - [23/Jan/2024:17:53:32 +0100] "GET /_matrix/client/v3/sync?filter=3&timeout=30000&set_presence=online&since=s322387_6844284_4_361249_403590_75_6308_2754_0_1 HTTP/1.1" 200 917 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.55 Chrome/120.0.6099.56 Electron/28.0.0 Safari/537.36"
x.x.x.x - - [23/Jan/2024:17:53:32 +0100] "OPTIONS /_matrix/client/v3/sync?filter=2&timeout=30000&set_presence=unavailable&since=s322387_6844285_4_361249_403590_75_6308_2754_0_1 HTTP/1.1" 204 406 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 OPR/106.0.0.0"
x.x.x.x - - [23/Jan/2024:17:53:32 +0100] "OPTIONS /_matrix/client/v3/sync?filter=3&timeout=30000&set_presence=unavailable&since=s322387_6844286_4_361249_403590_75_6308_2754_0_1 HTTP/1.1" 204 406 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.55 Chrome/120.0.6099.56 Electron/28.0.0 Safari/537.36"

Anything else that would be useful to know?

No response

FlyveHest commented 7 months ago

Following up a bit, over the night the problem subsided for a few hours, possibly because clients where disconnected entirely (machine shut down), but it started again early this morning, with 2 clients showing the same symptoms.

Over the course of the morning, it has now become 6 clients that do this, with around 35 requests a second, in what looks like a GET and then OPTIONS cycle.

FlyveHest commented 7 months ago

Disabling presence on homeserver seems to have removed the problem (not that surprising)

I am wondering how it began, though, nothing have changed on the homeserver, which has been running for years, and all of a sudden a multitude of different clients start flooding presence state changes.

agrimpard commented 7 months ago

Hello, I've had exactly the same problem. I realised it when disk space alerts appeared with apache/matrix/systemd logs taking up more than 2GB each per day and the processor struggling.

Disk space Screenshot 2024-01-24 at 14-55-11

Apache CPU Screenshot 2024-01-24 at 14-57-23

Apache total requests Screenshot 2024-01-24 at 15-00-47

After investigation, I found the same thing: clients spam the server with GET/OPTION sync requests and disabling presence solves the problem.

When I noticed the problem, the clients had Element 1.55 and the web clients 1.53. Synapse was at 1.93. After updating the clients to the latest version and synapse to 1.99, I still had the same problem. There were no configuration changes before/during the problem, apart from activating/deactivating presences.

I have around thirty active users and it's becoming a problem very, very quickly.

Installation Method: Debian package Database: Postgres Workers: Single process Platform: Debian 11

rda0 commented 7 months ago

I have the same problem. Extreme CPU usage in the presence and generic-sync workers. It gradually started appearing on the 22.01.2024 at around 13:00 CET with version 1.98.0 (with no changes to the server since the restart into 1.98.0 on the 22.12.2023). During evening/night time the graphs looked quiet and as usual. The CPU usage went up again suddenly on the next day (23.) during office hours (and continued after an upgrade to 1.99.0). I also assume that it could be caused by certain clients or client versions.

synapse_presence_2024_01_25

rda0 commented 7 months ago

The symptoms look like https://github.com/matrix-org/synapse/issues/16057, which did not reappear for months until now.

plui29989 commented 7 months ago

We have this problem as well.

Synapse Version: 1.99.0 Installation Method: Docker (matrixdotorg/synapse) Database: Postgres Workers: Single process Platform: Ubuntu 22.04, in docker

gc-holzapfel commented 7 months ago

here as well, same config:

Synapse 1.99.0 (matrixdotorg/synapse) Docker & Postgres on Ubuntu 22.04

rogue client is on osx, v1.11.56-rc.0

rda0 commented 7 months ago

@gc-holzapfel How did you identify the client? I am just interested, since I looked at clients versions and usernames in log snippets of good and bad minutes, but could not find a pattern. It always seemed as if all of them were involved in the ~10x request count during bad minutes.

gc-holzapfel commented 7 months ago

i had a look at the synapse/docker-logs and there is (so far) only one misbehaving client...although we have other users with same os&client that do not cause problems.

meise commented 7 months ago

I can see this behaviour also on one of our Synapse 1.99 instances. CPU and HTTP request increase 10 to 20 times more than usual. general _load_next_mult_id and general MultiWriterIdGenerator._update_table in DB transactions by txn rate increase in the same range: 2024-02-05-125905_740x271_scrot 2024-02-05-124029_740x683_scrot

OneBlue commented 7 months ago

I'm hitting the same issue with multiple clients on : {"server_version":"1.87.0","python_version":"3.7.3"}

I confirm that disabling presence via:

presence:
  enabled: false

Solves the issue.

Ronchonchon45 commented 7 months ago

I had the same issue on the same date; January 23, 2023 around 10:30 a.m. My server version was 1.98, he was updated to 1.100, but the issue persists. We disabled presence to stop the issue, but we need to have this option enabled.

Synapse Version: 1.100 Installation Method: repo apt Database: Postgres 15 Python: 3.11.2 Platform: debian 12

agrimpard commented 6 months ago

I have three separate environments, test with synapse monolith, preprod with synapse workers, prod with synapse workers and I still see the problem. It doesn't matter whether I update Synapse or Element, as soon as I reactivate the presences, we DDOS ourselves.

Would it be possible, even if it's a synapse bug, to limit the number of API GET/OPTION sync calls in Element to, I don't know, one every 10 seconds? At least not 15 per second ... is it really useful to make so many calls in such a short space of time?

Presence is a feature we'd really like to keep active!

jlemangarin commented 5 months ago

Still HARD flooding the browser and the server 1.98.0

raphaelbadawi commented 5 months ago

I'm hitting the same issue with multiple clients on : {"server_version":"1.87.0","python_version":"3.7.3"}

I confirm that disabling presence via:

presence:
  enabled: false

Solves the issue.

In Synapse 1.98 this configuration doesn't seem to work anymore. So flooding issue still present and not bypassable...

avanhovedev commented 5 months ago

Same problem 1.98.0

bjoe2k4 commented 5 months ago

On 1.101.0 (~ 150 MAU, no workers) and disabling presence is working for me. However it is a disaster for me having to disable a very basic and important feature of synapse with no fix in sight while new releases (not fixing the problem) keep flying in.

I'm a small fish, but I'm wondering what the guys with the big servers are doing to help with the issue.

jlemangarin commented 5 months ago

@clokep Do you have insight on this considering your work on https://github.com/matrix-org/synapse/pull/16544 ?

clokep commented 5 months ago

@clokep Do you have insight on this considering your work on https://github.com/matrix-org/synapse/pull/16544 ?

No. Maybe there's a misconfiguration? Multiple workers fighting or something. I haven't read in depth what's going on though -- I'm no longer working at Element.

jlemangarin commented 5 months ago

Actually this is really simple to reproduce, just did it in 2 min, you just have to :

image

Whatever the issue is, it shows a lack of programming security at multiple layers :

We will try to dirty patch it for now and put the patch here if this is relevant.

FlyveHest commented 5 months ago

Actually this is really simple to reproduce, just did it in 2 min, you just have to :

Interesting, does this make it a client issue more than a server one, I wonder? I'm not really sure who triggers the syncing, if its the client misbehaving or if the server should actively try to stop this behaviour.

agrimpard commented 5 months ago

Interesting, does this make it a client issue more than a server one, I wonder? I'm not really sure who triggers the syncing, if its the client misbehaving or if the server should actively try to stop this behaviour.

For me it's both. Element shouldn't make so many calls on its own. Synapse should block if too many calls are made, otherwise any client can initiate a DDOS on its APIs.

FlyveHest commented 5 months ago

For me it's both.

That would make the most sense, yes, and in fact, it would make the most sense that the server isn't DDoSable through its official APIs.

jlemangarin commented 5 months ago

I think the client properly send it's own status, then just polls and wait for an answer as a classic polling would do (even if also as you mentionned, polling flood could actually be identified client-side and prevented/delayed)

But yeah, here the server have the main role, processing the conflicting presence requests and controling properly the polling system.

Aside of a polling flood control, a simple fix in the presence feature may be :

So it would stay in the same defined order BUSY > ONLINE > ...

image

I think this is approximately the hotfix we are trying to apply rn, we'll keep you updated about that

cc. @raphaelbadawi

raphaelbadawi commented 5 months ago

Hello,

I've made a little patch (tested on 1.98 and current master branch). For users like us who may have force-enabled multi-tab sessions and find themselves with incohesive states within the same device, this solves the "blinking" between online and idle states which flooded the server and sometimes thread-blocked the client.

synapse_16843.txt

git apply --ignore-space-change --ignore-whitespace synapse_16843.txt

clokep commented 5 months ago

force-enabled multi-tab sessions

What does this mean?

raphaelbadawi commented 5 months ago

force-enabled multi-tab sessions

What does this mean?

We patched matrix-react-sdk so we can have multi-tabs session. This is why we had this flood: if a tab was awake and another was inactive, it kept syncing online->inactive->online->inactive into the same device id. The previous fix avoided flood among different devices, but not in this peculiar case.

clokep commented 5 months ago

What does "multi-tab sessions" mean? Does it mean you're logging into multiple different accounts? This sounds like it might be a client issue then if the data isn't properly segmented between the accounts or something like that.

jlemangarin commented 5 months ago

Confirmed it (finally 🥳 ) passed all QA tests with different presence states properly merged and not flooding the polling anymore !

image

image

=>

image

=>

image

@clokep I think @raphaelbadawi says that he have multiple tabs (on the same account) sending different presence states, as if you have multiple clients, and then a third one on another browser that got flooded, this is the test case I use.

Thank you, finally passed the tests for production !

Next step would be to confirm how this could be introduced in the recently changed state merger mechanism

raphaelbadawi commented 5 months ago

the have multiple tabs (on the same account) sending different presence states, as if you have multiple clients

Yes this was my use case, same user id, same token, but on multiple tabs.

jlemangarin commented 5 months ago

Hello,

I've made a little patch (tested on 1.98 and current master branch). For users like us who may have force-enabled multi-tab sessions and find themselves with incohesive states within the same device, this solves the "blinking" between online and idle states which flooded the server and sometimes thread-blocked the client.

synapse_16843.txt

git apply --ignore-space-change --ignore-whitespace synapse_16843.txt

@agrimpard Could you try and give us a feedback ? Just got live in production today and seems to work !

cc. @Ronchonchon45 @OneBlue @gc-holzapfel @plui29989 @FlyveHest

agrimpard commented 5 months ago

@agrimpard Could you try and give us a feedback ? Just got live in production today and seems to work !

Do you have a git repository or the files concerned? I don't use git with Synapse ...

agrimpard commented 5 months ago

I've just tested the patch.

I started by stopping Matrix, I applied the patch on /opt/venvs/matrix-synapse/lib/python3.11/site-packages/synapse/handlers/presence.py, I reactivated the presences and I restarted Matrix. I don't see any "presence storm".

I then stopped Matrix again, removed the patch on /opt/venvs/matrix-synapse/lib/python3.11/site-packages/synapse/handlers/presence.py, presences still enabled and restarted Matrix. I no longer see any "presence storm".

I should have started by reactivating the presences without the patch to see a real change. I wonder if the patch hasn't done some sort of reset on the presences and so, even without the patch, I'm out of the presence bug. I need to do some more tests but it doesn't look bad at first glance :)

For those who want to test, you have the source presence.py patched above this message ... Make sure you test the presences without the patch first !

Ronchonchon45 commented 5 months ago

I applied the patch on Monday (25th). Always was good, but this morning at 10AM, the problem is back. But now, if I disable presence, the problem persist. I've disabled access_log for ngninx, and INFO log for the application to stopped the problem

raphaelbadawi commented 5 months ago

Hello @Ronchonchon45 . How do you reproduce the issue on your side ? For me it was when being logged in with the same user on several tabs at the same time.

Ronchonchon45 commented 5 months ago

I don't know how the problem came back. No problem after patch in 2 days, and Wednesday at 3pm, I saw that the disk space was low, and that the log file size was big since 10am (1.5g in 5 hours). How can I check if users is logged on several tabs at the same time?

raphaelbadawi commented 5 months ago

What do you have in the actual logs ? Is it flooding user presence state update (if a user state blinks rapidly between two states it may be related to multitab) or is it something else ?