Open hirbod opened 6 years ago
Some more facts I forgot to mention.
Installed server/client version: 12.0.0 Server: 1 Single Server currently, fully dedicated (redis is also a single server, fully dedicated) AMD Ryzen 7 1700X Octa-Core "Summit Ridge" (Zen) 64 DDR4 RAM 500GB SSD
14 workers, 2 brokers. All on the same machine. We have no code inside broker.js, except the one which was created from CLI using "socketcluster create app". We're using sc-uws.
I might need to mention that we have nginx on the same server as socketcluster, as we're using SSL and we just proxy_pass everything from nginx to node. We've highly tuned sysctl.conf and the limits.conf to use the total maximum on a linux distro based on hundreds of tutorials on the net.
The clients are connected on port 443 with a https domain to node and node is passing everything to socketcluster. I don't know if this could also be any reason for the delayed arrivals.
We have a guy upstairs who's app is experiencing this exact same thing. I've pointed him to this issue. So far, I haven't heard anything on our app but who knows. We have a lot of traffic but much of the traffic is editing things in the DB and the response time isn't important.
My first question is (let's assume they did not disconnect and every watcher is working fine) Is it possible that a publish won't be delivered to a user? This is a serious question.
If you're running SC on a single machine, the only realistic case where this can happen is if one of the processes crashes (e.g. because of an uncaught error) but that should be pretty obvious from the logs.
The second and much bigger issue is: currently, the user has 10 seconds time to pick an answer...
Maybe the user's socket was disconnected for a few seconds and their answer was sent after a delay (when the connection came back).
maybe your server is overloaded since you are running multiple programs on it; what is the CPU % usage on that machine (e.g. with the top
command)?
0,2 % CPU, the server is just running nginx and socketcluster with "forever". mysql, redis etc are dedicated own servers.
How is Redis and MySQL performance?
MySQL will not be called when answers getting in, only redis, which is a dedicated, 8 core intel i7 with SSD and 64GB DDR4 RAM. We've been running millions of request against redis in under a sec, redis is extremely performant
I'm having the same problem and I've spent days working on it. I do not, by any means have the same traffic as the OP. I'm still developing the application that's using SC. However I've been trying to isolate the problem by measuring the time it takes for the client's request to arrive to the server.
My method to track latency is not super accurate but its accurate enough for me to identify that the problem does not exist in any code that I've written. My server is also not very powerful either. It doesn't need to be for multiple reasons.
Regardless of these factors, responsiveness and performance of client/server communication has not been consistent. My investigations into this problem have shown that the lag that I'm seeing can happen because there is a lag time in one or both of these situations.
The performance problems are intermittent. The server, under normal load hasn't once gone over 15% of its overall CPU capacity. RAM consumption hovers usually around 17% and doesn't have any wild fluctuations. The SC run application is behind an NGINX reverse proxy, and that's been updated, reconfigured, and tested to see if it was responsible for the performance issues, and at this point NGINX has been eliminated as the cause for the intermittent slow downs.
This application has been on SC 9.3.1, 11.1.0, 11.3.1, and now the latest 13.1.3. Only recently has fixing this irregularity in performance has become a priority for me, so since 11.3.1 have I been trying to track down the problem.
The Server's configuration is as follows:
Typical ping time is 70 ms to the server from client computers that have experienced the slow down. Apache Benchmark hitting the /health-check route on the server shows the following.
Connection Times (ms)
min mean[+/-sd] median max
Connect: 415 934 243.8 901 2136
Processing: 77 190 115.7 144 619
Waiting: 77 155 71.5 134 496
Total: 575 1124 240.3 1063 2235
Lastly, I'm measuring the execution times for all server operations as part of my investigation, they are not responsible for the slow downs in SC traffic.
@MatthewAry it looks like your HTTP requests are slow as well. Is it also slow when you have no users on it? Could it be that your cloud provider is throttling your bandwidth? It looks like you're using a small machine.
What happens if you run a plain Node.js HTTP server next to your SC instance and try to connect to it? Is it also slow? What WebSocket wsEngine are you using? What version? How does it behave if you use ws
as the wsEngine?
@Hirbod When you say that CPU % is 0.2%, is that at the peak? I suspect that for your use case, you would have very low CPU usage and then a significant peak every 10 second or so (when all the answers are sent from clients at almost the same time).
Watching "top" live, I've never seen the CPU load going over 1%. Memory also 61GB free from 64GB. Using "speedometer" to watch the network-interface, there is an output around 1Mbit and input about 1 Mbit.
The server is basically sleeping and has a lot of headroom.
It's not the cloud provider. I've seen it degrade on ws, sc-uws, and uws. I haven't tried running a plain node.js HTTP server next to it. Yes, it's a small machine, but what it's doing isn't too major either. Google's infrastructure ties instance bandwith to the number of vCPU's (we have 1) the instance has.
The egress throughput cap is dependent on the number of vCPUs that a virtual machine instance has. Each vCPU has a 2 Gbps egress cap for peak performance. Each additional vCPU increases the network cap, up to a theoretical maximum of 16 Gbps for each virtual machine. For example, a virtual machine instance with 4 vCPUs has an intra-zone network throughput cap of 2 Gbps 4 = 8 Gbps. A virtual machine instance with 8 vCPUs has an intra-zone network throughput cap of 2 Gbps 8 = 16 Gbps.
Specific workloads and benchmarks are regularly measured:
- A single stream performance test can achieve a maximum of 8.5 Gbps on average.
- A multistream performance test can achieve a maximum of 15 Gbps on average.
On the same server, we are serving up static files for the front end of the application using NGINX, it's performance as tested using Apache Benchmark is as follows.
Connection Times (ms)
min mean[+/-sd] median max
Connect: 302 932 338.9 856 5595
Processing: 74 161 88.1 126 399
Waiting: 74 126 51.0 104 321
Total: 412 1093 344.9 999 5793
Keep in mind that while doing my benchmarks, I'm hitting the server with 100 concurrent connections at a time.
I'm currently using sc-uws which version is 10.148.0 according to package-lock.json
@Hirbod are you guys on google? aws? nginx?
Trying to find the common denominator.
@happilymarrieddad nope, self-hosted, on hetzner. https://www.hetzner.com/dedicated-rootserver/matrix-ax?country=gb
Using AX60-SSD. I was start to think that AMD sucks and we should switch to an Intel Xeon E3...
@Hirbod Many cloud providers throttle CPU and bandwidth on smaller instances; sometimes it doesn't take much to trigger it. Not sure that's the problem but worth trying with a different instance to see if it makes a difference.
@jondubois It's a dedicated instance (not a shared vCPU) with guaranteed memory and computational power. Before we were on a shared instance where that could be the case, but we moved to a dedicated instance to eliminate the possibility even though our resource consumption was not high enough to justify the move.
In any case, @happilymarrieddad works down stairs from me and he's got me trying out one more thing with NGINX. I'll let you know how that goes.
Please share your informations @MatthewAry if it help. I also adjusted rfile_limit and set "tcp_nodelay" to "off" and added "proxy_buffering off;" and "proxy_request_buffering" to off; Tomorrow we will see how it goes.
Hahaha. @happilymarrieddad is hoping that my rlimit_nofile value is not high enough. So that's what I'm testing.
Nope, that didn't fix it.
Epic fail... dang it
In any case, @happilymarrieddad works down stairs
Haha good teamwork ;p
@Hirbod I just did a test on Amazon EC2 with 16K concurrent sockets subscribed to the same channel on an 8 core machine (7 workers, 1 broker). SocketCluster version 13.1.5
(using sc-uws
as the wsEngine).
I published a message to the channel once every 10 seconds:
The machine which I used to simulate the clients had 32 cores and was in the same region (US) as the server machine but in a different availability zone.
You may want to double-check that you're not accidentally publishing more messages than intended; for example if you publish from inside an event handler which is accidentally attached multiple times.
Did you do all of this behind an nginx proxy using ssl? I would love to know how this will perform. I am sure there are no multiple attached event handlers, this has been tested very well inside our app (we had that issue but fixed it)
@jondubois but more to mention: we're using "forever", 14 workers, 2 brokers, single machine and we're also doing a lot of logging. Every "received message" on the server will be stored asap into redis after it has been processed.
I should have mentioned that we're using our NGINX reverse proxy to SSL the connection to the application too.
It sounds like this issue could be related to nginx proxy_buffering see https://stackoverflow.com/questions/9612983/socket-io-slow-response-after-using-nginx
Well, I can try turning proxy buffering off entirely. @happilymarrieddad has his on and he hasn't seen any of the problems that I've been having.
Correction @happilymarrieddad doesn't have proxy buffering on. I've turned off Proxy Buffering on my machines. I'm in the process of collecting data to see if the issue is resolved.
@jondubois are you using SC without nginx? Could you recommend to renounce nginx as proxy and accessing node.js directly, setting port to 443? Could this help to increase performance? From my understanding, this should be avoided and always proxied by nginx. Should we use a single dedicated server for proxying connections instead of running nginx on the same machine ?
@Hirbod Have you observed the issue since you turned off proxy buffering?
We had so many other issues with our app today that I didn't have any chance to check it deeply enough, but from a quick look it seems like it's working so far.
We have a dedicated loadbalancer (nginx) and everything else in the network is on their own respective instance. We haven't noticed any of these problems. We are also using SCC if that makes a difference. 1 scc-state server and 2 scc-broker servers.
@Hirbod I'm pretty sure that a lot of companies use nginx in front of SC, it just seems to be bit more work to configure properly. I have some experience with HAProxy and node-http-proxy and they were pretty easy to setup with SC/WebSockets.
Running SC without a load balancer is also good sometimes; one advantage is that it uses less CPU overall but one downside is that you miss out on features that a load balancer might provide in terms of firewall, rate limiting, routing, etc...
I think that many realtime online games don't use load balancers; often rate limiting is handled explicitly in the backend logic along with other logic to prevent players from cheating. Middleware in SC can be useful for this.
So, proxy_buffering seemed to fix it for me. Furthermore, there was a memory leak inside my app (beside the trivia feature, we have a live chat integrated) and 500 incoming messages per second and an immediate DOM append caused a very high reflow on my app which peaked Android CPUs up to the limit, causing the answers beeing transmitted very very slow. After fixing this issues, I can't see any delayed messages anymore.
@MatthewAry have a look here https://github.com/emqtt/emqttd/wiki/linux-kernel-tuning
The 1 Million Socket sysctl.conf in that Wiki looks totally wrong to me and also it has some security issues imho.
I choosed @behrad recommendation as it mostly fits to other docs I've seen. (like https://blog.jayway.com/2015/04/13/600k-concurrent-websocket-connections-on-aws-using-node-js/)
I don't see any issues anymore
Well, I've got Proxy Buffering off, but I'm still seeing issues. I'm going to look into this kernel tuning stuff.
Keep in mind that calling „sysctl -p“ won’t propergate everything immediately, tw-reuse needs an reboot for example.
So the best thing is to tweak and reboot.
Has this been fixed for everybody?
I don't know yet. It's gonna be awhile until I find out for sure. My priorities have changed for the moment.
@MatthewAry do you connect all clients at once, or is distribution even?
If possible if would be nice to see how this behaves if you turn of SSL. I suspect that it might be related to low entropy available on the machine, and new SSL connections can't be made quick enough.
@toredash I work downstairs from him and I can tell you he's using letsencrypt so SSL is turned on. Our servers are handling a lot of users and we haven't seen this issue. We bought our SSL from comodo but I'm pretty sure he's using a very similar nginx config to me.
Still seeing the issue. I'm working on some other things but as a side effect of that work I'm still seeing the slowness.
:point_up: @matetukacs message in the Gitter room June 5, 2018 6:01 PM
I guess it's possible your issues are related. It's just an assumption though. I feel like you did try many things to find the culprit of the issue already. The only question that comes in mind - did you try to:
Yes I did all of that. I determined that lag does not exist in my server side code or on the client side code but the mechanisms that facilitate communication between client and server. I added time stamps to the client and server and compared times. I also measured execution times of my server side code. The slow down happens when I pass things off to Socket Cluster to send to the client/server. I can see slow down in the client's message reaching the server or the server's message reaching the client. I've also tried to eliminate the possibility that the issue is network related, but my tests show good ping and jitter values. So in the end, I'm not sure what the hold up is.
Hi everybody,
you guys may already seen some issues of me and you might know that we use a real time trivia app. Our app is getting a lot of attention currently and the first issues are arising.
Our trivia app will publish a json-object to all connected users. Its just a question and 3 answers. This works great, we currently have 2500 users playing live. From time to time we receive mails people complaining "they did not receive a question".
My first question is (let's assume they did not disconnect and every watcher is working fine) Is it possible that a publish won't be delivered to a user? This is a serious question.
The second and much bigger issue is: currently, the user has 10 seconds time to pick an answer. After 10 seconds, the user can't emit to the server (blocked inside the app). But the user could chose his answer at 9,99 second. On the server (worker.js) we've added another 3500ms of "latency" buffer, means, we will still accept users answer when it arrives 13 seconds after publishing.
Today, we had around 2500 live users playing (we will have prob. 50.000 Users in about 30 days) We received angry mails, because the answers haven't be accepted. After watching our logs, we realized, that the latest message arrived 28 seconds later.
So 2500 user have chosen a answer from 1-10 seconds after publishing. How can it be possible that the last message arrived 28 seconds after emitting. Is this something common? Are we doing something wrong?
We've seen that 99% of messages are getting checked against redis in-time, but around 1% of the answers have been processed after 15 seconds, some at 18 and the longest took, as already said, 28 seconds.
So every message arrived after 13500ms from the publish event will be declined.
What shall we do? We're pretty lost on this. I've seen issues that it might be a good idea to put a setTimeout on PUBLISH_OUT Middleware for around 1-500ms, but from our testings, every message published from the server came in nearly at real-time. Talking for myself, I've never had a issue with delayed answers when playing or testing myself, but the logs are real and the issue too.
Any ideas, advices, thoughts?