Grokzen / redis-py-cluster

Python cluster client for the official redis cluster. Redis 3.0+.
https://redis-py-cluster.readthedocs.io/
MIT License
1.1k stars 316 forks source link

Network is unreachable with Kubernetes #469

Closed abhi314 closed 2 years ago

abhi314 commented 2 years ago

I am running the below code in the EKS pod.

redis-py-cluster = "2.1.3"
----------------------------
from redis import Redis
from rediscluster import RedisCluster

respCluster = 'error'
respRegular = 'error'
host = "vpce-XXX.us-east-1.vpce.amazonaws.com"
port = "6379"

try:
    ru = RedisCluster(startup_nodes=[{"host": host, "port": port}], 
                  decode_responses=True, skip_full_coverage_check=True)
    respCluster = ru.get('ABC')
except Exception as e:
    print(e)

try:
    ru = Redis(host=host, port=port, decode_responses=True)
    respRegular = ru.get('ABC')
except Exception as e:
    print(e)

return {"respCluster": respCluster, "respRegular": respRegular}

The Redis cluster itself is in AWS Elasticache of another account. I am using this method to access it.

While the code is running in another accounts Kubernetes.

The response I am getting is

{'respCluster': 'error', 'respRegular': '123456789'}

And the error I am getting is

redis.exceptions.ConnectionError: Error 101 connecting to XX.XXX.XX.XXX:6379. Network is unreachable
rediscluster.exceptions.ClusterError: TTL exhausted

This is strange since the redis-py also connects to the same IP and port, this is not making any sense to me. More details

Grokzen commented 2 years ago

Hi @abhi314 First this is not technically an issue but more a Discussion topic, but...

If this is what i suspect then this is probably one of the common issues with nodes routing in a redis-cluster. First what i can recommend that you do is to look into the RedisCluster object after your created it and it has successfully run a initialize and talked to your initial cluster nodes and got the cluster configuration. After that you can inspect the variables inside the nodemanager object and specially these https://github.com/Grokzen/redis-py-cluster/blob/master/rediscluster/nodemanager.py#L40 and see what they say. Those variables contain the slots configuration that the client will use to access all the nodes in the cluster.

So what i think is going on here is a common cluster routing issue inside redis-server itself. So when you create a redis-cluster you point 2-n number of nodes to talk and do a handshake with eachother in order to initialize the cluster and to exchange what IP:port pairs that each node needs to use in order to reach the other nodes in the cluster. This data and these values where they be ip:port or dns:port is what redis-server stores and sends back to the cluster clients when you connect to a cluster and sends CLUSTER INFO & CLUSTER SLOTS commands to it to get the current cluster state. But note here that this is not the IP:port pair that a Client needs to be able to connect, but what the cluster thinks it needs to connect to eachother, and this same information is what is exposed to the clients as already mentioned.

So what i think is happening in your wonky setup is that your clients get a faulty set of either ip or dns names from redis-server and is trying to use that to connect to the nodes when you run your command and when it fails about 16 times it will throw TTL error.

So begin with inspecting your cluster client and all the internal variables to see what state your client really is and determine if what is exposed to your client is really reachable by that ip:port or dns:port from where the client is.

If that is the case you either have to make a simpler AWS solution and not do a janky account jump or you can look into the feature added in recent releases here https://github.com/Grokzen/redis-py-cluster/blob/3b68c18810c2e8cea20d7e900064b1f8ec811260/docs/client.rst#host-port-remapping where if you know what you want to rebind from -> to then you can rebind whatever the redis-server sends back to you in order to get what you want. Note this feature is not super strong and works mostly on simple rebinds and do not support complex situations or solutions.

What other thing you can do is to look in the exception or debug loggers for this lib and see what you get out from there if you are on the 2.1.x version track.

abhi314 commented 2 years ago

@Grokzen Thank you, you were right. The issue was resolved using hot port remapping. The code was trying to access the actual IP from account A instead of the DNS name from account B.