RedisCluster fails to connect to an AWS Elasticache Cluster running Redis 7.1.0

trentbitterman commented 7 months ago

Expected Behaviour

Initializing a RedisCluster object using RedisCluster.from_url followed by calling the initialize method on the RedisCluster object should result in a successful connection to the Redis cluster.

Current Behaviour

The initialization fails with the following error message:

coredis.exceptions.RedisClusterException: Redis Cluster cannot be connected. Please provide at least one reachable node. Underlying errors:
--
  | - Unable to use RESP3 due to missing `HELLO` implementation the server. Use `protocol_version=2` when constructing the client. [AWS cluster configuration endpoint]

Steps to Reproduce

conn = RedisCluster.from_url(
        redis_url,
        ssl_context=ssl_context,
        password=password,
        connect_timeout=1.0,
    )
await conn.initialize()

Your Environment

coredis version: 4.16.0
Redis version: Elasticache 7.1, which AWS says is equivalent to OSS Redis 7.0
Operating system: Ubuntu 22.04 x86_64
Python version: 3.11.8

This issue started appearing after upgrading our Redis cluster to version 7.1.0. We were previously using version 6.2.6. Other than the cluster version upgrade, no other configuration changes were made in either the cluster or our clients.

I verified myself that the upgrade cluster does in fact support the HELLO command by communicating with it directly. This was the output I got:

HELLO 3 AUTH default password
%7
$6
server
$5
redis
$7
version
$5
7.1.0
$5
proto
:3
$2
id
:72917
$4
mode
$7
cluster
$4
role
$6
master
$7
modules
*0

I compared that output to running the same command against another Redis cluster of ours that we haven't upgraded yet and received the exact same response, aside from the version and id differing.

That error message seems to be triggered when and UnknownCommandError is raised. Looking through the source I couldn't find any other commands that seemed like they would result in an UnknownCommandError in any of our clusters. I did see CLUSTER SLOTS was deprecated in Redis 7.0.0, but I still got a valid response when I tried it.

Here's the full stacktrace we see. Please let me know if there's any other information I can provide that would help. Thanks!

File "/home/appuser/.local/lib/python3.11/site-packages/our_code/redis_wrapper.py", line 78, in create_redis_cluster_connection
    await conn.initialize()
  File "/home/appuser/.local/lib/python3.11/site-packages/coredis/client/cluster.py", line 623, in initialize
    await super().initialize()
  File "/home/appuser/.local/lib/python3.11/site-packages/coredis/client/basic.py", line 328, in initialize
    await self.connection_pool.initialize()
  File "/home/appuser/.local/lib/python3.11/site-packages/coredis/pool/cluster.py", line 162, in initialize
    await self.nodes.initialize()
  File "/home/appuser/.local/lib/python3.11/site-packages/coredis/pool/nodemanager.py", line 271, in initialize
    raise RedisClusterException(
coredis.exceptions.RedisClusterException: Redis Cluster cannot be connected. Please provide at least one reachable node. Underlying errors:
Unable to use RESP3 due to missing `HELLO` implementation the server. Use `protocol_version=2` when constructing the client. [AWS configuration endpoint]

One other piece of information that might be helpful. I also tried setting up the client with protocol_version set to 2, as the error message suggested, and the client seemed to connect correctly.

trentbitterman commented 7 months ago

One other piece of information that may be useful to know. When we upgraded our redis cluster to 7.1.0 all existing connections continued working. They were able to reconnect to the newer version without issue as the new nodes came up. Only brand-new connections that were created after the upgraded experienced this issue.

Let me know if there's any testing I can do in my environment that would be useful. I'm happy to help in any way possible.

alisaifee commented 6 months ago

@trentbitterman I'll try to setup the environment myself to debug this - however, if it is trivially easy for you to create a throw away test environment for me for a short duration, I wouldn't complain :)

alisaifee commented 6 months ago

Managed to reproduce it - and though the error handling / exception message still need to be improved, this 1be7935547fedaaefb9337d38a08130d73ffe26c fixes it in master.

If you could give that a quick try I can do a quick release.

trentbitterman commented 6 months ago

Great, thanks @alisaifee! I'll give it a try and let you know.

trentbitterman commented 6 months ago

@alisaifee I tried that new version in my environment and no longer ran into this issue. Thanks for fixing it!

alisaifee commented 6 months ago

Unfortunately I'm having some issues with clearing CI but will hopefully have a dot release by EOD.

trentbitterman commented 6 months ago

Sounds good, thanks again!

alisaifee commented 6 months ago

Fix available in 4.17.0

trentbitterman commented 6 months ago

Awesome, thanks!

alisaifee / coredis