Getting 'Could not create read handler' errors

bsergean commented 4 years ago

My cluster seems ok, no restarts of any nodes in the cluster. I have enabled the 'multiple-endpoint' features.

[2020-03-24 05:25:21.975/5] Populate connection pool: failed to install write handler for node 172.26.32.220:6379
[2020-03-24 05:25:21.976/5] Could not create read handler: No such file or directory
[2020-03-24 05:25:21.976/5] Populate connection pool: failed to install write handler for node 172.25.145.138:6379
[2020-03-24 05:25:21.976/5] Could not create read handler: No such file or directory
[2020-03-24 05:25:21.976/5] Populate connection pool: failed to install write handler for node 172.27.86.24:6379
[2020-03-24 05:25:21.976/5] Could not create read handler: No such file or directory
[2020-03-24 05:25:21.976/5] Populate connection pool: failed to install write handler for node 172.27.36.226:6379
[2020-03-24 05:25:22.001/5] Could not create read handler: No such file or directory
[2020-03-24 05:25:22.001/5] ERROR: Failed to create read query handler for client 5:762 from 172.26.199.114:57978
[2020-03-24 05:25:54.529/6] Could not create read handler: No such file or directory
[2020-03-24 05:25:54.529/6] ERROR: Failed to create read query handler for client 6:904 from 172.25.212.157:36710

I am at this commit:

commit 6751bf515fcef0a46c273f0199e49794592529ec (origin/unstable, origin/HEAD)
Author: artix <artix2@gmail.com>
Date:   Wed Mar 18 16:49:26 2020 +0100

    Use exit code 1 when test fails (Fix #46)

bsergean commented 4 years ago

My problem is that my cluster seems to be down, I have one node which is in fail state.

b50e18d0b2ae6cd597d351f46734656e4533f22d 172.25.225.42:6379 master,fail

artix75 commented 4 years ago

@bsergean Do you have the entire log?

bsergean commented 4 years ago

Sorry this was a terrible and not very actionable bug report. I'm back in business now, and I've restarted the proxy ... so the log is gone. I think I could get to it but I'm not savy enough with openshift to know how to retrieve previous logs.

FYI I fixed the bad node by having other nodes forget it with

redis-cli -c -h 172.24.244.119 CLUSTER FORGET b50e18d0b2ae6cd597d351f46734656e4533f22d

bsergean commented 4 years ago

The the cluster was unhealthy, the proxy would try to start and fail like that, this must be the normal behavior.

[2020-03-25 14:46:58.313/M] Starting 8 threads...
[2020-03-25 14:46:58.313/M] Fetching cluster configuration...
FATAL: failed to connect to Redis Cluster
[2020-03-25 14:46:58.316/M] Could not connect to Redis at 172.24.244.119:1379: Connection refused
[2020-03-25 14:46:58.316/M] ERROR: Failed to fetch cluster configuration!
[2020-03-25 14:46:58.316/M] FATAL: failed to create thread 0.

bsergean commented 4 years ago

Something which might be interesting too is that I got a 'max clients errors' reached at some point. I'm not sure whether this is an application level bug (there has been a lot in the past), or if a leak can happen on the proxy too.

What do you think would be a good way to monitor this ? I know that the INFO command on the proxy does not work, so maybe I can count the network connections externally ?

bsergean commented 4 years ago

Here is the exact error that the application (python) received from talking to the proxy: 'ERR max number of clients reached' / I could be application level + redis-cluster down combination.

artix75 commented 4 years ago

@bsergean I'm currently working on a new feature that lets the proxy start even if the cluster is down. In that case, the proxy would start and clients could connect to it, but they would receive a "CLUSTERDOWN The cluster is down" error. The proxy periodically checks if the cluster goes up and then it would connect to it normally.

As for the ERR max number of clients reached error, it could be the proxy itself, or may the cluster. Consider that, both the Proxy and Redis, have their own max-clients property that can be set via command-line or config file, but that is modulated by Redis or by the Proxy depending on the ulimit of the system. You can try the file limit on your system via the ulimit UNIX command.

bsergean commented 4 years ago

Cool. The CLUSTERDOWN thing looks like a great feature. I think the ulimit file descriptor limits are insanely high on those machines, so I believe it's coming the proxy or the cluster.

Also on a different note I hope that everything is fine for you in Italy. The whole world learn to leave from home, I'm watching exercise videos on youtube because I heat too much :)

On Mar 25, 2020, at 9:51 AM, artix notifications@github.com wrote:

@bsergean https://github.com/bsergean I'm currently working on a new feature that lets the proxy start even if the cluster is down. In that case, the proxy would start and clients could connect to it, but they would receive a "CLUSTERDOWN The cluster is down" error. The proxy periodically checks if the cluster goes up and then it would connect to it normally.

As for the ERR max number of clients reached error, it could be the proxy itself, or may the cluster. Consider that, both the Proxy and Redis, have their own max-clients property that can be set via command-line or config file, but that is modulated by Redis or by the Proxy depending on the ulimit of the system. You can try the file limit on your system via the ulimit UNIX command.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RedisLabs/redis-cluster-proxy/issues/47#issuecomment-603956341, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2O6UKSA3IPIFZKWY5N2S3RJIZBFANCNFSM4LSMGYSQ.

bsergean commented 4 years ago

I got a similar error, but now I'm seeing a message about the connection pool as well. I can see it with a lot of different ip addresses.

[2020-04-18 22:04:12.902/5] Populate connection pool: failed to install write handler for node 172.26.163.119:6379
[2020-04-18 22:04:12.902/5] Could not create read handler: No such file or directory

I tried to look at one server and it had a lot of opened ports, and I don't know if that's the application level creating too many connections.

/proc/1/fd $ find . | wc -l
2359
/proc/1/fd $ ls
0     1095  1192  129   1387  1484  1581  1799  1905  2001  21    2197  26    343   3633  3730  432   53    627   724   821   919
1     1096  1193  1290  1388  1485  1582  18    1906  2002  210   2198  260   344   3634  3731  433   530   628   725   822   92
10    1097  1194  1291  1389  1486  1583  180   1907  2003  2100  2199  261   345   3635  3732

However the machine has a pretty high ulimit, but it's running in kubernete so this could be shared resources, and the cluster numbers.

~ $ ulimit -n
1048576

bsergean commented 4 years ago

The part that is interesting is that I got a degraded performance over the week-end, and it seems to have been linked to either a memory or a cpu limit being reached. The proxy runs on kubernete and there are resource limit. I might have done a resharding when this happened, and I'm gonna experiment with doing more of those to see if they could lead to leaks. Have you played with resharding while the proxy is running, and seeing how memory is affected ?

I think that the memory graph is really interesting.

When this happened I gave more cpu and memory to the server, and so far so good.

(some picture after the restart with more resources)

bsergean commented 4 years ago

The memory going from 5G to 15G all of a sudden is really super suspicious.

The blue line on the cpu graph represent an upper bound, after which the cpu usage is throttled by kubernete. Some links here, I never tried too hard to understand the differences between requests and limit.

RedisLabs / redis-cluster-proxy

Getting 'Could not create read handler' errors #47