SocketCluster / socketcluster

Highly scalable realtime pub/sub and RPC framework
https://socketcluster.io
MIT License
6.14k stars 313 forks source link

20 seconds delay #410

Open hirbod opened 6 years ago

hirbod commented 6 years ago

Hi everybody,

you guys may already seen some issues of me and you might know that we use a real time trivia app. Our app is getting a lot of attention currently and the first issues are arising.

Our trivia app will publish a json-object to all connected users. Its just a question and 3 answers. This works great, we currently have 2500 users playing live. From time to time we receive mails people complaining "they did not receive a question".

My first question is (let's assume they did not disconnect and every watcher is working fine) Is it possible that a publish won't be delivered to a user? This is a serious question.

The second and much bigger issue is: currently, the user has 10 seconds time to pick an answer. After 10 seconds, the user can't emit to the server (blocked inside the app). But the user could chose his answer at 9,99 second. On the server (worker.js) we've added another 3500ms of "latency" buffer, means, we will still accept users answer when it arrives 13 seconds after publishing.

Today, we had around 2500 live users playing (we will have prob. 50.000 Users in about 30 days) We received angry mails, because the answers haven't be accepted. After watching our logs, we realized, that the latest message arrived 28 seconds later.

So 2500 user have chosen a answer from 1-10 seconds after publishing. How can it be possible that the last message arrived 28 seconds after emitting. Is this something common? Are we doing something wrong?

We've seen that 99% of messages are getting checked against redis in-time, but around 1% of the answers have been processed after 15 seconds, some at 18 and the longest took, as already said, 28 seconds.

So every message arrived after 13500ms from the publish event will be declined.

What shall we do? We're pretty lost on this. I've seen issues that it might be a good idea to put a setTimeout on PUBLISH_OUT Middleware for around 1-500ms, but from our testings, every message published from the server came in nearly at real-time. Talking for myself, I've never had a issue with delayed answers when playing or testing myself, but the logs are real and the issue too.

Any ideas, advices, thoughts?

hirbod commented 6 years ago

Some more facts I forgot to mention.

Installed server/client version: 12.0.0 Server: 1 Single Server currently, fully dedicated (redis is also a single server, fully dedicated) AMD Ryzen 7 1700X Octa-Core "Summit Ridge" (Zen) 64 DDR4 RAM 500GB SSD

14 workers, 2 brokers. All on the same machine. We have no code inside broker.js, except the one which was created from CLI using "socketcluster create app". We're using sc-uws.

I might need to mention that we have nginx on the same server as socketcluster, as we're using SSL and we just proxy_pass everything from nginx to node. We've highly tuned sysctl.conf and the limits.conf to use the total maximum on a linux distro based on hundreds of tutorials on the net.

The clients are connected on port 443 with a https domain to node and node is passing everything to socketcluster. I don't know if this could also be any reason for the delayed arrivals.

happilymarrieddad commented 6 years ago

We have a guy upstairs who's app is experiencing this exact same thing. I've pointed him to this issue. So far, I haven't heard anything on our app but who knows. We have a lot of traffic but much of the traffic is editing things in the DB and the response time isn't important.

jondubois commented 6 years ago

My first question is (let's assume they did not disconnect and every watcher is working fine) Is it possible that a publish won't be delivered to a user? This is a serious question.

If you're running SC on a single machine, the only realistic case where this can happen is if one of the processes crashes (e.g. because of an uncaught error) but that should be pretty obvious from the logs.

The second and much bigger issue is: currently, the user has 10 seconds time to pick an answer...

Maybe the user's socket was disconnected for a few seconds and their answer was sent after a delay (when the connection came back). maybe your server is overloaded since you are running multiple programs on it; what is the CPU % usage on that machine (e.g. with the top command)?

hirbod commented 6 years ago

0,2 % CPU, the server is just running nginx and socketcluster with "forever". mysql, redis etc are dedicated own servers.

jondubois commented 6 years ago

How is Redis and MySQL performance?

hirbod commented 6 years ago

MySQL will not be called when answers getting in, only redis, which is a dedicated, 8 core intel i7 with SSD and 64GB DDR4 RAM. We've been running millions of request against redis in under a sec, redis is extremely performant

MatthewAry commented 6 years ago

I'm having the same problem and I've spent days working on it. I do not, by any means have the same traffic as the OP. I'm still developing the application that's using SC. However I've been trying to isolate the problem by measuring the time it takes for the client's request to arrive to the server.

My method to track latency is not super accurate but its accurate enough for me to identify that the problem does not exist in any code that I've written. My server is also not very powerful either. It doesn't need to be for multiple reasons.

  1. It's still under development.
  2. At any time, the most users that we have using the application is 5.
  3. SC is being used instead of a regular REST API.
  4. It's traffic isn't anything close to that of a chat application, or a game where state needs to be shared across multiple clients. It's just a simple REST API replacement.

Regardless of these factors, responsiveness and performance of client/server communication has not been consistent. My investigations into this problem have shown that the lag that I'm seeing can happen because there is a lag time in one or both of these situations.

  1. It takes a long time (seconds) for the client's request to reach the event handler in the worker.
  2. It takes a long time (seconds) for the server's response to the client's message to reach the client.

The performance problems are intermittent. The server, under normal load hasn't once gone over 15% of its overall CPU capacity. RAM consumption hovers usually around 17% and doesn't have any wild fluctuations. The SC run application is behind an NGINX reverse proxy, and that's been updated, reconfigured, and tested to see if it was responsible for the performance issues, and at this point NGINX has been eliminated as the cause for the intermittent slow downs.

This application has been on SC 9.3.1, 11.1.0, 11.3.1, and now the latest 13.1.3. Only recently has fixing this irregularity in performance has become a priority for me, so since 11.3.1 have I been trying to track down the problem.

The Server's configuration is as follows:

Typical ping time is 70 ms to the server from client computers that have experienced the slow down. Apache Benchmark hitting the /health-check route on the server shows the following.

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:      415  934 243.8    901    2136
Processing:    77  190 115.7    144     619
Waiting:       77  155  71.5    134     496
Total:        575 1124 240.3   1063    2235

Lastly, I'm measuring the execution times for all server operations as part of my investigation, they are not responsible for the slow downs in SC traffic.

jondubois commented 6 years ago

@MatthewAry it looks like your HTTP requests are slow as well. Is it also slow when you have no users on it? Could it be that your cloud provider is throttling your bandwidth? It looks like you're using a small machine.

What happens if you run a plain Node.js HTTP server next to your SC instance and try to connect to it? Is it also slow? What WebSocket wsEngine are you using? What version? How does it behave if you use ws as the wsEngine?

jondubois commented 6 years ago

@Hirbod When you say that CPU % is 0.2%, is that at the peak? I suspect that for your use case, you would have very low CPU usage and then a significant peak every 10 second or so (when all the answers are sent from clients at almost the same time).

hirbod commented 6 years ago

Watching "top" live, I've never seen the CPU load going over 1%. Memory also 61GB free from 64GB. Using "speedometer" to watch the network-interface, there is an output around 1Mbit and input about 1 Mbit.

The server is basically sleeping and has a lot of headroom.

MatthewAry commented 6 years ago

It's not the cloud provider. I've seen it degrade on ws, sc-uws, and uws. I haven't tried running a plain node.js HTTP server next to it. Yes, it's a small machine, but what it's doing isn't too major either. Google's infrastructure ties instance bandwith to the number of vCPU's (we have 1) the instance has.

The egress throughput cap is dependent on the number of vCPUs that a virtual machine instance has. Each vCPU has a 2 Gbps egress cap for peak performance. Each additional vCPU increases the network cap, up to a theoretical maximum of 16 Gbps for each virtual machine. For example, a virtual machine instance with 4 vCPUs has an intra-zone network throughput cap of 2 Gbps 4 = 8 Gbps. A virtual machine instance with 8 vCPUs has an intra-zone network throughput cap of 2 Gbps 8 = 16 Gbps.

Specific workloads and benchmarks are regularly measured:

  • A single stream performance test can achieve a maximum of 8.5 Gbps on average.
  • A multistream performance test can achieve a maximum of 15 Gbps on average.

On the same server, we are serving up static files for the front end of the application using NGINX, it's performance as tested using Apache Benchmark is as follows.

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:      302  932 338.9    856    5595
Processing:    74  161  88.1    126     399
Waiting:       74  126  51.0    104     321
Total:        412 1093 344.9    999    5793

Keep in mind that while doing my benchmarks, I'm hitting the server with 100 concurrent connections at a time.

I'm currently using sc-uws which version is 10.148.0 according to package-lock.json

happilymarrieddad commented 6 years ago

@Hirbod are you guys on google? aws? nginx?

happilymarrieddad commented 6 years ago

Trying to find the common denominator.

hirbod commented 6 years ago

@happilymarrieddad nope, self-hosted, on hetzner. https://www.hetzner.com/dedicated-rootserver/matrix-ax?country=gb

Using AX60-SSD. I was start to think that AMD sucks and we should switch to an Intel Xeon E3...

jondubois commented 6 years ago

@Hirbod Many cloud providers throttle CPU and bandwidth on smaller instances; sometimes it doesn't take much to trigger it. Not sure that's the problem but worth trying with a different instance to see if it makes a difference.

MatthewAry commented 6 years ago

@jondubois It's a dedicated instance (not a shared vCPU) with guaranteed memory and computational power. Before we were on a shared instance where that could be the case, but we moved to a dedicated instance to eliminate the possibility even though our resource consumption was not high enough to justify the move.

In any case, @happilymarrieddad works down stairs from me and he's got me trying out one more thing with NGINX. I'll let you know how that goes.

hirbod commented 6 years ago

Please share your informations @MatthewAry if it help. I also adjusted rfile_limit and set "tcp_nodelay" to "off" and added "proxy_buffering off;" and "proxy_request_buffering" to off; Tomorrow we will see how it goes.

MatthewAry commented 6 years ago

Hahaha. @happilymarrieddad is hoping that my rlimit_nofile value is not high enough. So that's what I'm testing.

MatthewAry commented 6 years ago

Nope, that didn't fix it.

happilymarrieddad commented 6 years ago

Epic fail... dang it

jondubois commented 6 years ago

In any case, @happilymarrieddad works down stairs

Haha good teamwork ;p

@Hirbod I just did a test on Amazon EC2 with 16K concurrent sockets subscribed to the same channel on an 8 core machine (7 workers, 1 broker). SocketCluster version 13.1.5 (using sc-uws as the wsEngine).

I published a message to the channel once every 10 seconds:

The machine which I used to simulate the clients had 32 cores and was in the same region (US) as the server machine but in a different availability zone.

You may want to double-check that you're not accidentally publishing more messages than intended; for example if you publish from inside an event handler which is accidentally attached multiple times.

hirbod commented 6 years ago

Did you do all of this behind an nginx proxy using ssl? I would love to know how this will perform. I am sure there are no multiple attached event handlers, this has been tested very well inside our app (we had that issue but fixed it)

@jondubois but more to mention: we're using "forever", 14 workers, 2 brokers, single machine and we're also doing a lot of logging. Every "received message" on the server will be stored asap into redis after it has been processed.

MatthewAry commented 6 years ago

I should have mentioned that we're using our NGINX reverse proxy to SSL the connection to the application too.

jondubois commented 6 years ago

It sounds like this issue could be related to nginx proxy_buffering see https://stackoverflow.com/questions/9612983/socket-io-slow-response-after-using-nginx

MatthewAry commented 6 years ago

Well, I can try turning proxy buffering off entirely. @happilymarrieddad has his on and he hasn't seen any of the problems that I've been having.

MatthewAry commented 6 years ago

Correction @happilymarrieddad doesn't have proxy buffering on. I've turned off Proxy Buffering on my machines. I'm in the process of collecting data to see if the issue is resolved.

hirbod commented 6 years ago

@jondubois are you using SC without nginx? Could you recommend to renounce nginx as proxy and accessing node.js directly, setting port to 443? Could this help to increase performance? From my understanding, this should be avoided and always proxied by nginx. Should we use a single dedicated server for proxying connections instead of running nginx on the same machine ?

MatthewAry commented 6 years ago

@Hirbod Have you observed the issue since you turned off proxy buffering?

hirbod commented 6 years ago

We had so many other issues with our app today that I didn't have any chance to check it deeply enough, but from a quick look it seems like it's working so far.

happilymarrieddad commented 6 years ago

We have a dedicated loadbalancer (nginx) and everything else in the network is on their own respective instance. We haven't noticed any of these problems. We are also using SCC if that makes a difference. 1 scc-state server and 2 scc-broker servers.

jondubois commented 6 years ago

@Hirbod I'm pretty sure that a lot of companies use nginx in front of SC, it just seems to be bit more work to configure properly. I have some experience with HAProxy and node-http-proxy and they were pretty easy to setup with SC/WebSockets.

Running SC without a load balancer is also good sometimes; one advantage is that it uses less CPU overall but one downside is that you miss out on features that a load balancer might provide in terms of firewall, rate limiting, routing, etc...

I think that many realtime online games don't use load balancers; often rate limiting is handled explicitly in the backend logic along with other logic to prevent players from cheating. Middleware in SC can be useful for this.

hirbod commented 6 years ago

So, proxy_buffering seemed to fix it for me. Furthermore, there was a memory leak inside my app (beside the trivia feature, we have a live chat integrated) and 500 incoming messages per second and an immediate DOM append caused a very high reflow on my app which peaked Android CPUs up to the limit, causing the answers beeing transmitted very very slow. After fixing this issues, I can't see any delayed messages anymore.

hirbod commented 6 years ago

@MatthewAry have a look here https://github.com/emqtt/emqttd/wiki/linux-kernel-tuning

The 1 Million Socket sysctl.conf in that Wiki looks totally wrong to me and also it has some security issues imho.

I choosed @behrad recommendation as it mostly fits to other docs I've seen. (like https://blog.jayway.com/2015/04/13/600k-concurrent-websocket-connections-on-aws-using-node-js/)

I don't see any issues anymore

MatthewAry commented 6 years ago

Well, I've got Proxy Buffering off, but I'm still seeing issues. I'm going to look into this kernel tuning stuff.

hirbod commented 6 years ago

Keep in mind that calling „sysctl -p“ won’t propergate everything immediately, tw-reuse needs an reboot for example.

So the best thing is to tweak and reboot.

happilymarrieddad commented 6 years ago

Has this been fixed for everybody?

MatthewAry commented 6 years ago

I don't know yet. It's gonna be awhile until I find out for sure. My priorities have changed for the moment.

toredash commented 6 years ago

@MatthewAry do you connect all clients at once, or is distribution even?

If possible if would be nice to see how this behaves if you turn of SSL. I suspect that it might be related to low entropy available on the machine, and new SSL connections can't be made quick enough.

happilymarrieddad commented 6 years ago

@toredash I work downstairs from him and I can tell you he's using letsencrypt so SSL is turned on. Our servers are handling a lot of users and we haven't seen this issue. We bought our SSL from comodo but I'm pretty sure he's using a very similar nginx config to me.

MatthewAry commented 6 years ago

Still seeing the issue. I'm working on some other things but as a side effect of that work I'm still seeing the slowness.

MegaGM commented 6 years ago

:point_up: @matetukacs message in the Gitter room June 5, 2018 6:01 PM

I guess it's possible your issues are related. It's just an assumption though. I feel like you did try many things to find the culprit of the issue already. The only question that comes in mind - did you try to:

  1. redundantly log somewhere timestamps of every step that message goes through (yourClienSideJS, emit, middlewares, eventListeners, etc)
  2. analize the data, especially of the clients with 10s+ issues Between which components the lag appears?
MatthewAry commented 6 years ago

Yes I did all of that. I determined that lag does not exist in my server side code or on the client side code but the mechanisms that facilitate communication between client and server. I added time stamps to the client and server and compared times. I also measured execution times of my server side code. The slow down happens when I pass things off to Socket Cluster to send to the client/server. I can see slow down in the client's message reaching the server or the server's message reaching the client. I've also tried to eliminate the possibility that the issue is network related, but my tests show good ping and jitter values. So in the end, I'm not sure what the hold up is.