Rocket.Chat stops working with 1000 active users

AmShaegar13 commented 6 years ago

Description:

For us, Rocket.Chat does not work with more than 1000 active users. Rebooting a server, restarting Apache or restarting Rocket.Chat after an update causes all clients to face serious issues connecting to the chat.

Steps to reproduce:

Setup a chat with 1000 simultaneous active users
Restart all instances at once.

Expected behavior:

Clients can reconnect to the chat.

Actual behavior:

While reconnecting the server sends an enormous amount of the following messages over websocket:

{"msg":"added","collection":"users","id":"$userId","fields":{"name":"$displayName","status":"online","username":"$username","utcOffset":2}}

{"msg":"changed","collection":"users","id":"$userId","fields":{"status":"online"}}

{"msg":"removed","collection":"users","id":"$userId"}

This continues until the server closes the websocket. I assume, this is due to the lack of ping-pong messages in this time. The client instantly requests a new websocket starting the whole thing over and over again.

The only effective way to get the cluster up and working again is to force-logout all users by deleting their loginTokens from mongodb directly.

Server Setup Information:

Version of Rocket.Chat Server: 0.65.2
Operating System: Debian 8.11
Deployment Method: tar with pm2
Number of Running Instances: 8 virtual machines with 3 instances each (24 instances)
DB Replicaset Oplog: On
NodeJS Version: 8.9.4
MongoDB Version: 3.4.9

Additional context

The high amount of instances we operate directly results from this issue. When we first ran into it with about 700 users, we assumed we might need to scale the cluster accordingly but we are not willing to add another server to the cluster for every 40 new users. We planned to support around 8000 users. Approximately half of them active.

For now, we do not allow mobile clients yet. We would really love to do so but with the current state of the cluster this wont happen soon.

magicbelette commented 6 years ago

Do you have Apache2 or Nginx for the frontend ? Maybe you've reached some limitations (MaxClients?) for the frontend.

What about system usage (RAM, CPU, Network, FS) for the machines of the cluster?

Cheers

AmShaegar13 commented 6 years ago

We are using Apache as reverse proxy. The servers have 16 GB RAM available and only 1.5 GB used per instance. CPU usage is going up to its limit during the reconnects.

screenshot from 2018-06-28 14-53-04

kaiiiiiiiii commented 6 years ago

Sounds like you reached the maximum mongodb connections (1024 by default on linux as far as I know).

vynmera commented 6 years ago

@AmShaegar13 You could try using nginx instead of Apache, and as suggested check your mongo settings?

qchn commented 6 years ago

hi @kaiiiiiiiii, I am the admin of @AmShaegar13's Rocket.Chat-Setup, he kindly asked me to post this here:

001-rs:PRIMARY> db.serverStatus().connections
{ "current" : 182, "available" : 51018, "totalCreated" : 3234457 }

root@rocketchatdb:~# lsof -i | grep mongodb | wc -l
186

So this shouldn't be a thing…

Best, qchn

magicbelette commented 6 years ago

Did you check Apache2 log for MaxClients reached ?

qchn commented 6 years ago

Yes, @magicbelette, thanks for the hint. We configured MaxClients to 1500 per node and we're far from reaching that.

AmShaegar13 commented 6 years ago

@qchn Thanks. ;)

@magicbelette Yup. no errors regarding max clients. At most, rare proxy connection timeouts (about once an hour).

@vynmera Trying is nothing I can easily do. This requires another downtime for our users. Additionally, I don't really suspect Apache to be the problem here. Node is causing the CPU load and HTTP is doing fine. I can load all scripts and assets just fine. Just the websocket never finishes receiving those collection update messages.

jhermann commented 6 years ago

Sounds like a job for exponential back-off on the client side, after say 2-3 failed web socket reconnects.

dmoldovan123 commented 6 years ago

hello, You can try with haproxy and use forever-service to run nodes. haproxy -> n nodes -> 1 server mongodb

AmShaegar13 commented 6 years ago

@dmoldovan123 As already mentioned in my reply to vynmera I can't just try various things. I have to maintain a stable service for 1000 active users. So if you could give me a hint why haproxy with forever-service would be better than apache with pm2 I would be really greatful. This would give me something to justify breaking the service (again) on purpose.

The thing is, I do not see a different proxy or service manager reduce status-changed messages over the websocket.

magicbelette commented 6 years ago

Don't think that's the best idea ever but you can easily test without PM2, directly with systemd. I don't really know about PM2 but the fact is that you add an another layer and potentially a bottleneck.

Another thing according to my experience, be careful with Apache2 config... My instance was incredibly slow (3 seconds to load each avatar). My Apache2 uses mpm_prefork with a dumb copy/paste (MaxRequestsPerChild 1). Servers were consuming a lot of resources forking new processes with a bad user experience but there was no system load. Took me a couple of days to figure it out :/

AmShaegar13 commented 6 years ago

I am using pm2 in fork mode so no extra layer should be present. 3 instances of Rocket.Chat are running. Each with its own port.

Cluster mode did not work for some reason.

dmoldovan123 commented 6 years ago

https://rocket.chat/docs/installation/manual-installation/multiple-instances-to-improve-performance/ use haproxy not nginx. It's work very fast with haproxy.

AmShaegar13 commented 6 years ago

@dmoldovan123 This is what we already do. As you can see in the issue description, I am running 8 servers with 3 instances each to utilize CPU cores behind a reverse proxy. I don't see how another proxy would impact CPU load of node processes. We are using mongodb with replicaSet and instances can communicate with each other because I set INSTANCE_IP.

I am pretty sure, this issue is related to this one in the user-presence library Rocket.Chat uses as well.

@rodrigok @sampaiodiego Can one of you confirm this?

magicbelette commented 6 years ago

I think that https://github.com/Konecty/meteor-user-presence/issues/17 it's the best lead, but did you check your database engine ? https://rocket.chat/docs/installation/manual-installation/multiple-instances-to-improve-performance/#database-engine

AmShaegar13 commented 6 years ago

Thanks for all of your suggestions. We could now prove that the UserPresenceMonitor was responsible for the denial of service we faced.

We disabled it on all but two separate instances and can restart the cluster now without causing tons of status updates.

We did so by patching the source and setting USER_PRESENCE_MONITOR environment:

--- rocket.chat/programs/server/app/app.js  2018-07-04 18:07:36.917547890 +0200
+++ app.js  2018-07-04 18:10:12.273401726 +0200
@@ -7753,7 +7753,10 @@

   InstanceStatus.registerInstance('rocket.chat', instance);
   UserPresence.start();
-  return UserPresenceMonitor.start();
+
+  if (process.env['USER_PRESENCE_MONITOR']) {
+    return UserPresenceMonitor.start();
+  }
 });
 /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

I would still like to have an official fix for this rather than patching the source with every update.

@magicbelette We still use the slow database engine but we do not observer high CPU load or memory usage on database servers yet.

elie222 commented 6 years ago

I am also surprised that user presence hasn't caused more issues for more people. I raised the issue a while back.

Another question is whether pm2 is handling sessions properly. Meteor uses sticky sessions and if not handled properly your servers may be doing a lot of extra work constantly logging in new users.

I'd look to add kadira (meteor apm) to check app performance. Node chef offers a solution for 50 dollars per month (as does meteor galaxy hosting but then you have to use their hosting service which is pricey)

magicbelette commented 6 years ago

I'm not sure but I'm under the feeling that this patch causes users to appear offline for some others. As consequence, users receive email notification even if they are online.

@AmShaegar13 did you notice it or am I totally wrong ?

AmShaegar13 commented 6 years ago

I think, you are not. Usually, the status appears to be correct but I already had complaints about unnecessary emails. So yes, there is still something wrong with it and I am still hoping this will be fixed. But for now, we can at least use the chat again.

Currently, 1177 users at it's peak.

AmShaegar13 commented 6 years ago

The same appears to happen the other way round. No email notification although the user is offline in all clients.

AmShaegar13 commented 6 years ago

This is getting a major annoyance. More and more users complain about broken notifications. This issue is a major drawback for acceptance in our company.

geekgonecrazy commented 6 years ago

How many instances are you distributing the load across?

AmShaegar13 commented 6 years ago

We are running 6 servers with 3 instances each without UserPresenceMonitor (see the patch above) behind a load balancer. Additionally, we run 2 servers with 1 instance each with UserPresenceMonitor not balanced, so no users can reach them. Those two servers are dedicated to running the UserPresenceMonitor.

This setup keeps the cluster at least stable but causes the aforementioned problems with notifications.

geekgonecrazy commented 6 years ago

Just wanted to follow up here. We are working through another case like this. So this is definitely on our radar.

nmagedman commented 6 years ago

We were having pretty much identical symptoms. CPU pegged at 100%. Packet storm.

We implemented @AmShaegar13’s July 5 patch and (combined with splitting the servers into two Auto-Scaling Groups with different environment variables set) it solved that problem. We then noticed that we were experiencing some of the side-effects mentioned above, including users being marked as Away even when actively using the app. User activity would mark the user as Online for a split-second but then the user would return to Away.

I was concerned that this fix completely broke the User Presence system, but the almost-immediately-Away problem turned out to be something much simpler. Not a runtime failure, but just a configuration bug. As discussed in Issue https://github.com/RocketChat/Rocket.Chat/issues/11309#issuecomment-430816373, releases 0.64.0 and 0.66.0 changed the semantics of the "Idle Time Limit" user config setting, changing the units of the idle timeout from milliseconds to seconds. I don't know if the migrations were broken, or ran twice, or something else, but the end result is that the 300-second idle timeout somehow became 0.3 seconds!

Point being, be aware that there are multiple issues to manage here.

KirbySSmith commented 6 years ago

We see a similar issues with CPU related to users status when doing blue green deploys. It seems to be in part related to the activeUsers publication. https://github.com/RocketChat/Rocket.Chat/blob/0.70.4/server/publications/activeUsers.js

When a sever goes offline all the clients connections for the server are removed from the db by other online servers or by the next server to come online. https://github.com/Konecty/meteor-user-presence/blob/master/server/server.js#L82

When the clients reconnect to the new servers they create new client connections.

Both of these triggers the users records to get updated, status offline then online. Since the activeUsers publication notifies each client about changes to active users, that could be the number of active users x2 records sent to each client to process. This causes the clients fall behind in processing user statuses. It also seems to have a snowball effect because each client will try to report the users status multiple times as it struggles to sync user statuses. You can see the flood of users status updates using chrome dev tools and monitor the Web Socket frames when restarting the server.

tpDBL commented 6 years ago

Question for @geekgonecrazy: If this is on the radar as you say, would it be a bad thing to enter the 4 line workaround of AmShaegar13 mentioned earlier as a pull request, because it could be an option for user groups not using the UserPresence monitor in the mean time?

sampaiodiego commented 6 years ago

@tpDBL I've added that on https://github.com/RocketChat/Rocket.Chat/pull/12353

nmagedman commented 6 years ago

They did. https://github.com/RocketChat/Rocket.Chat/pull/12353 However, they (correctly) changed the semantics of the environment variable from opt-in to opt-out.

--Noach

On Mon, Oct 22, 2018 at 9:44 AM tpDBL notifications@github.com wrote:

Question for @geekgonecrazy https://github.com/geekgonecrazy: If this is on the radar as you say, would it be a bad thing to enter the 4 line workaround of AmShaegar13 mentioned earlier as a pull request, because it could be an option for user groups not using the UserPresence monitor in the mean time?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/RocketChat/Rocket.Chat/issues/11288#issuecomment-431751871, or mute the thread https://github.com/notifications/unsubscribe-auth/AA_neqCyxXw1rHdTdYmJpxj20UtkTdD9ks5unWlOgaJpZM4U7RiY .

cpitman commented 5 years ago

Just chiming in, the symptoms described here are very similar to what I've seen running an instance with ~800 active users. Any restart during the busy period would cause a massive increase in network traffic and unresponsive instances. I was never able to diagnose the issue down to a root cause, so glad to see this progress!

AmShaegar13 commented 5 years ago

Still experiencing problems when lots of clients need to reconnect at once. Just had another downtime of half an hour because of this. Even with presence monitor limited to two of 20 instances. Around 1200 users online.

Looks like we had some network issues which caused a lot of clients to reconnect. The cluster however did not recover on its own. We had to stop it completely.

nmagedman commented 5 years ago

@AmShaegar13, we've been pretty stable since fixing some bad configuration settings a few weeks ago, all having to do with Apple Push Notifications:

/admin/Push Enable: True Enable Gateway: True Gateway: https://gateway.rocket.chat

Running v0.69.2. 1800 users. 10 EC2 instances, 2 of which run UPM. 3 RocketChat daemons per instance.

Our log files are still filled with errors:

Error sending push to gateway (2 try) -> { Error: failed [400] {"code":108,"error":"bad request sent to apn","requestId":"[random UUID]","status":400}

and APNs are unreliable (obviously), but we're stable.

joequilter commented 5 years ago

@AmShaegar13 Not an expert in Rocket chat, but had some experience of broken servers under load and interested in this problem as we evaluate rocket chat. Have you run https://github.com/netdata/netdata and looked for bottlenecks? Common issues I found normally related to various OS limits being set wrong (file handles and sockets in particular), too long timeouts, disk contention (though unlikely to be an issue on AWS), memory, and process locks. Netdata is particularly good at giving a good overview of which thing is breaking stuff.

AmShaegar13 commented 5 years ago

We use our own tooling for that displayed on grafana dashboards which gives you a pretty good overview as well. I will see if I can post a screenshot of today's outage later. Also, please refer to the graph I posted back in June.

I will have a look at netdata and see if that will yield any further data. However, I don't know when this will happen again, so wont be posting any new data soon.

By the way, we are using HAProxy after some internal infrastructure upgrade. This, apparently, did not change anything. I think we reached another max users mark our setup can support but I don't know how to fix that.

In my opinion, the problem is the expensive reconnect mechanism that multiplies when lots of clients reconnect at once. The server sends a list of all active users to the connecting client which takes a while and may be blocking. But I'm only guessing here.

nmagedman commented 5 years ago

In my opinion, the problem is the expensive reconnect mechanism that multiplies when lots of clients reconnect at once.

We worked around that problem by stopping the server and letting it sit for several minutes. After the clients fail to connect, they wait for a progressively longer retry delay. When we finally bring the servers back up, the clients return gradually. Ugly, but it was usually successful.

jhermann commented 5 years ago

Adding explicit back-pressure to enter that back-off mode early during startup of the server could at least reduce the severity of the problem, and is a relatively easy fix (time-limited or load-induced addition of a header, and code to act on it with existing back-off in the clients).

AmShaegar13 commented 5 years ago

We limited the rate of newly created sessions by HAProxy now. This might prevent flooding the servers with reconnects next time.

Here's our monitoring for the recent incident: clipboard - february 27 2019 10_23 am

Abraka commented 5 years ago

Wohohou, Im just in a middle of rocketchat deployment in our organisation and this does not look good. For start I plan about 200-700 active users and later if everything will work good, there is waiting potentional 2000-3000 users. Nothing fancY like mobile notification, but now its looks like rocket chat have problems handling even few hundred users. This make me nervous... And also why do you AmShaegar13 running so much servers with few instances instead of running one-two with 10+ rocket chat instances?

AmShaegar13 commented 5 years ago

We only use VMs with 4 cores and doubled the amount of servers when we faced these issues for the first time. This obviously did not help. We use 3 instances per server as recommended in Running Multiple Instances Per Host To Improve Performance.

Abraka commented 5 years ago

I have everything in one 8core-vm,16Gb Ram, including mongodb. Point is for me to have all components together, to get rid off any necessary rocket chat component communication over network. Currently 4 instances running with empty channels and 1 connected admin. So far no problems with load :)

AmShaegar13 commented 5 years ago

Looks like our issue has been fixed by #14488, thanks a lot!

introspection3 commented 4 years ago

Looks like our issue has been fixed by #14488, thanks a lot!

@AmShaegar13 Sir,could you tell me how many users does your one single instance support now？can one single instance support more than 5k people （and what the server's physical config is ? ）

AmShaegar13 commented 4 years ago

Our current setup is 8 servers with one instance each. Downscaling from 3 instances each helped us support more users, apperently. These NxN connections between each and every Rocket.Chat instance seem to limit scalability.

Currently, we have 2100 active users. Howevery, Rocket.Chat is far from stable. About once a day, CPU load of single instances raises to 100 % slowing the instance down. This increases the response time. It raises to more than 45s in extreme cases.

If we are fast enough to remove that instance from the load balancer, it eventually recovers. Otherwise, other instances will follow until the whole cluster is unusable and needs to be completely stopped and restarted.

From my point of view, Rocket.Chat will, in its current state, not be able to handle 5000 active users.

introspection3 commented 4 years ago

@AmShaegar13 do you try the newer version (2.4.11)，（and what each server's physical config is ? ）

AmShaegar13 commented 4 years ago

We are currently running 2.4.8. Can't tell you much about the physical hardware as the servers are virtual machines in our internal VM cluster. 4 cores, 16 GB RAM. That's all I have at the moment. Also, we stopped using pm2 and use systemd services now. However, this should not have any impact.

introspection3 commented 4 years ago

I am doubting the reason is monodb server

AmShaegar13 commented 4 years ago

No. mongodb is pretty stable and can even handle high load. Also, our mongodbs are on extra hosts.

introspection3 commented 4 years ago

you can try nginx http2. could tell me :does pm2 work?

AmShaegar13 commented 4 years ago

No, I cannot try anything. I have 2000 users in home office because of COVID-19 relying on a stable service.

RocketChat / Rocket.Chat