Open bsergean opened 4 years ago
My problem is that my cluster seems to be down, I have one node which is in fail state.
b50e18d0b2ae6cd597d351f46734656e4533f22d 172.25.225.42:6379 master,fail
@bsergean Do you have the entire log?
Sorry this was a terrible and not very actionable bug report. I'm back in business now, and I've restarted the proxy ... so the log is gone. I think I could get to it but I'm not savy enough with openshift to know how to retrieve previous logs.
FYI I fixed the bad node by having other nodes forget it with
redis-cli -c -h 172.24.244.119 CLUSTER FORGET b50e18d0b2ae6cd597d351f46734656e4533f22d
The the cluster was unhealthy, the proxy would try to start and fail like that, this must be the normal behavior.
[2020-03-25 14:46:58.313/M] Starting 8 threads...
[2020-03-25 14:46:58.313/M] Fetching cluster configuration...
FATAL: failed to connect to Redis Cluster
[2020-03-25 14:46:58.316/M] Could not connect to Redis at 172.24.244.119:1379: Connection refused
[2020-03-25 14:46:58.316/M] ERROR: Failed to fetch cluster configuration!
[2020-03-25 14:46:58.316/M] FATAL: failed to create thread 0.
Something which might be interesting too is that I got a 'max clients errors' reached at some point. I'm not sure whether this is an application level bug (there has been a lot in the past), or if a leak can happen on the proxy too.
What do you think would be a good way to monitor this ? I know that the INFO command on the proxy does not work, so maybe I can count the network connections externally ?
Here is the exact error that the application (python) received from talking to the proxy: 'ERR max number of clients reached' / I could be application level + redis-cluster down combination.
@bsergean I'm currently working on a new feature that lets the proxy start even if the cluster is down. In that case, the proxy would start and clients could connect to it, but they would receive a "CLUSTERDOWN The cluster is down" error. The proxy periodically checks if the cluster goes up and then it would connect to it normally.
As for the ERR max number of clients reached
error, it could be the proxy itself, or may the cluster.
Consider that, both the Proxy and Redis, have their own max-clients
property that can be set via command-line or config file, but that is modulated by Redis or by the Proxy depending on the ulimit
of the system.
You can try the file limit on your system via the ulimit
UNIX command.
Cool. The CLUSTERDOWN thing looks like a great feature. I think the ulimit file descriptor limits are insanely high on those machines, so I believe it's coming the proxy or the cluster.
Also on a different note I hope that everything is fine for you in Italy. The whole world learn to leave from home, I'm watching exercise videos on youtube because I heat too much :)
On Mar 25, 2020, at 9:51 AM, artix notifications@github.com wrote:
@bsergean https://github.com/bsergean I'm currently working on a new feature that lets the proxy start even if the cluster is down. In that case, the proxy would start and clients could connect to it, but they would receive a "CLUSTERDOWN The cluster is down" error. The proxy periodically checks if the cluster goes up and then it would connect to it normally.
As for the ERR max number of clients reached error, it could be the proxy itself, or may the cluster. Consider that, both the Proxy and Redis, have their own max-clients property that can be set via command-line or config file, but that is modulated by Redis or by the Proxy depending on the ulimit of the system. You can try the file limit on your system via the ulimit UNIX command.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RedisLabs/redis-cluster-proxy/issues/47#issuecomment-603956341, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2O6UKSA3IPIFZKWY5N2S3RJIZBFANCNFSM4LSMGYSQ.
I got a similar error, but now I'm seeing a message about the connection pool as well. I can see it with a lot of different ip addresses.
[2020-04-18 22:04:12.902/5] Populate connection pool: failed to install write handler for node 172.26.163.119:6379
[2020-04-18 22:04:12.902/5] Could not create read handler: No such file or directory
I tried to look at one server and it had a lot of opened ports, and I don't know if that's the application level creating too many connections.
/proc/1/fd $ find . | wc -l
2359
/proc/1/fd $ ls
0 1095 1192 129 1387 1484 1581 1799 1905 2001 21 2197 26 343 3633 3730 432 53 627 724 821 919
1 1096 1193 1290 1388 1485 1582 18 1906 2002 210 2198 260 344 3634 3731 433 530 628 725 822 92
10 1097 1194 1291 1389 1486 1583 180 1907 2003 2100 2199 261 345 3635 3732
However the machine has a pretty high ulimit, but it's running in kubernete so this could be shared resources, and the cluster numbers.
~ $ ulimit -n
1048576
The part that is interesting is that I got a degraded performance over the week-end, and it seems to have been linked to either a memory or a cpu limit being reached. The proxy runs on kubernete and there are resource limit. I might have done a resharding when this happened, and I'm gonna experiment with doing more of those to see if they could lead to leaks. Have you played with resharding while the proxy is running, and seeing how memory is affected ?
I think that the memory graph is really interesting.
When this happened I gave more cpu and memory to the server, and so far so good.
(some picture after the restart with more resources)
The memory going from 5G to 15G all of a sudden is really super suspicious.
The blue line on the cpu graph represent an upper bound, after which the cpu usage is throttled by kubernete. Some links here, I never tried too hard to understand the differences between requests and limit.
My cluster seems ok, no restarts of any nodes in the cluster. I have enabled the 'multiple-endpoint' features.
I am at this commit: