Cluster nodes command does not seem to be 100% redis compatible

applike-ss commented 7 months ago

Describe the bug We are running flink applications using the lettuce redis library. It seems to struggle with the output of dragonflies response to the cluster nodes command. Somehow it seems there is a difference to what it expects.

Output from dragonfly:

$ redis-cli cluster nodes
807df88cb947566738e9fa4689231944385a353b 127.0.0.1:6379@6379 myself,master - 0 0 0 connected 0-16383
c23888188f5d350b552aa8d8aa7ad40a05765b26 10.37.159.206:6379@6379 slave 807df88cb947566738e9fa4689231944385a353b 0 0 0 connected
c23888188f5d350b552aa8d8aa7ad40a05765b26 10.36.8.23:6379@6379 slave 807df88cb947566738e9fa4689231944385a353b 0 0 0 connected

Output from redis:

$ redis-cli cluster nodes
f7b5cf36a5ffc927a2de8b507b880e8cd60f45d3 10.36.54.105:6379@16379,redis-app-leader-2 master - 1710182388055 1710182385544 3 connected
09a46f18f4a8077ced90f4a394fc4f73ecc206e3 10.38.135.168:6379@16379,redis-app-leader-1 master - 0 1710413134000 2 connected 5461-10922
3991f305a18628b81d61e4040206ade513145f39 10.37.234.11:6379@16379,redis-app-leader-0 master - 0 1710413134727 1 connected 0-5460
2ddedc9d0cf086539b81ba1b22c88a7ec35a4f88 10.38.174.196:6379@16379,redis-app-follower-0 myself,slave 3991f305a18628b81d61e4040206ade513145f39 0 1710413133000 1 connected
273c6030cf5044e27675919b5317e3eeec880e91 10.36.123.79:6379@16379,redis-app-follower-2 master - 0 1710413134000 4 connected 10923-16383
147b1aa03a7f4552098f3139e8fb0924e6a20689 10.37.195.111:6379@16379,redis-app-follower-1 slave 09a46f18f4a8077ced90f4a394fc4f73ecc206e3 0 1710413135028 2 connected

The exception from lettuce:

io.lettuce.core.RedisConnectionException: Unable to establish a connection to Redis Cluster
    at io.lettuce.core.cluster.RedisClusterClient.lambda$assertInitialPartitions$26(RedisClusterClient.java:922)
    at io.lettuce.core.cluster.RedisClusterClient.get(RedisClusterClient.java:941)
    at io.lettuce.core.cluster.RedisClusterClient.assertInitialPartitions(RedisClusterClient.java:921)
    at io.lettuce.core.cluster.RedisClusterClient.connect(RedisClusterClient.java:398)
    at io.lettuce.core.cluster.RedisClusterClient.connect(RedisClusterClient.java:375)
    at io.justtrack.flink.annotators.repositories.StoreRedis.connect(StoreRedis.java:205)
    at io.justtrack.flink.annotators.repositories.StoreRedis.getRedis(StoreRedis.java:173)
    at io.justtrack.flink.annotators.repositories.StoreRedis.put(StoreRedis.java:108)
    ... 7 more
Caused by: io.lettuce.core.RedisException: Cannot parse 29bc106e11efccc5b3db4e3397511a0539c74b10 10.37.102.233:6379@6379 myself,master - 0 0 0 connected 0-16383
d4d9fd811e189f246d8c591fb10652f993933f39 10.36.112.239:6379@6379 slave 29bc106e11efccc5b3db4e3397511a0539c74b10 0 0 0 connected
d4d9fd811e189f246d8c591fb10652f993933f39 10.38.180.168:6379@6379 slave 29bc106e11efccc5b3db4e3397511a0539c74b10 0 0 0 connected

    at io.lettuce.core.cluster.models.partitions.ClusterPartitionParser.parse(ClusterPartitionParser.java:89)
    at io.lettuce.core.cluster.topology.NodeTopologyView.<init>(NodeTopologyView.java:68)
    at io.lettuce.core.cluster.topology.NodeTopologyView.from(NodeTopologyView.java:90)
    at io.lettuce.core.cluster.topology.DefaultClusterTopologyRefresh.getNodeSpecificViews(DefaultClusterTopologyRefresh.java:229)
    at io.lettuce.core.cluster.topology.DefaultClusterTopologyRefresh.lambda$null$1(DefaultClusterTopologyRefresh.java:108)
    at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture$Completion.run(Unknown Source)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
    at io.netty.util.concurrent.DefaultEventExecutor.run(DefaultEventExecutor.java:66)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    ... 1 more
Caused by: java.lang.NumberFormatException: For input string: "16383
"
    at java.base/java.lang.NumberFormatException.forInputString(Unknown Source)
    at java.base/java.lang.Integer.parseInt(Unknown Source)
    at java.base/java.lang.Integer.parseInt(Unknown Source)
    at io.lettuce.core.cluster.models.partitions.ClusterPartitionParser.readSlots(ClusterPartitionParser.java:173)
    at io.lettuce.core.cluster.models.partitions.ClusterPartitionParser.parseNode(ClusterPartitionParser.java:133)
    at io.lettuce.core.cluster.models.partitions.ClusterPartitionParser.parse(ClusterPartitionParser.java:85)
    ... 12 more

Expected behavior redis' and dragonfly should have compatible output

Environment (please complete the following information):

OS: linux
Kernel: 6.1
Containerized: Kubernetes
Dragonfly Version: v1.15.1

chakaz commented 7 months ago

@applike-ss hey, thanks for the bug report!

(1) Can you confirm that you run Dragonfly in emulated cluster mode? (--cluster_mode=emulated), as opposed to a real cluster (2) Can you maybe create minimal repro instructions? I don't know flink nor lettuce I'm afraid

chakaz commented 7 months ago

Looking more closely, I see that somehow Dragonfly has 2 replicas with the same node ID (c23888188f5d350b552aa8d8aa7ad40a05765b26). That is impossibly unlikely to be actual 2 distinct replicas, so I wonder - did something bad happen during that time? Like a replica changed IP address for some reason? Can you please try and see if this is the reason for the error (i.e. try running your app with a single replica in a good state)?

chakaz commented 7 months ago

Ok, I think I got it. It looks like Redis replies with just 0A (\n) between lines, while Dragonfly replies with 0D0A (\r\n):

Dragonfly:

00000000: 3530 6134 3139 3933 3030 6564 3837 3263  50a4199300ed872c
00000010: 3066 3434 6563 3765 3639 3333 6661 3430  0f44ec7e6933fa40
00000020: 3961 3261 3831 6538 2031 3237 2e30 2e30  9a2a81e8 127.0.0
00000030: 2e31 3a36 3337 3940 3633 3739 206d 7973  .1:6379@6379 mys
00000040: 656c 662c 6d61 7374 6572 202d 2030 2030  elf,master - 0 0
00000050: 2030 2063 6f6e 6e65 6374 6564 2030 2d31   0 connected 0-1
00000060: 3633 3833 0d0a                           6383..

Redis:

00000000: 3230 6337 6535 3939 6432 3665 6638 3133  20c7e599d26ef813
00000010: 3932 3861 3061 6138 6438 3865 3464 6262  928a0aa8d88e4dbb
00000020: 3838 3334 3730 3961 2031 3237 2e30 2e30  8834709a 127.0.0
00000030: 2e31 3a37 3030 3240 3137 3030 3220 6d61  .1:7002@17002 ma
00000040: 7374 6572 202d 2030 2031 3731 3034 3435  ster - 0 1710445
00000050: 3733 3237 3136 2033 2063 6f6e 6e65 6374  732716 3 connect
00000060: 6564 2031 3039 3233 2d31 3633 3833 0a    ed 10923-16383.
[...]

We do that both in CLUSTER INFO and in CLUSTER NODES, but for some reason Redis replies with \r\n for INFO but with only \n for NODES :shrug:

Redis CLUSTER INFO:

00000000: 636c 7573 7465 725f 7374 6174 653a 6f6b  cluster_state:ok
00000010: 0d0a 636c 7573 7465 725f 736c 6f74 735f  ..cluster_slots_
00000020: 6173 7369 676e 6564 3a31 3633 3834 0d0a  assigned:16384..
00000030: 636c 7573 7465 725f 736c 6f74 735f 6f6b  cluster_slots_ok
00000040: 3a31 3633 3834 0d0a 636c 7573 7465 725f  :16384..cluster_
[...]

Anyway, for some reason flink/lettuce is probably sensitive to that, so we should be compatible.

I'm still curious about the 2 nodes with the same ID situation you got there though.

applike-ss commented 7 months ago

@applike-ss hey, thanks for the bug report!

(1) Can you confirm that you run Dragonfly in emulated cluster mode? (--cluster_mode=emulated), as opposed to a real cluster (2) Can you maybe create minimal repro instructions? I don't know flink nor lettuce I'm afraid

Regarding the emulated cluster mode, yes we do use that in our dragonfly test setup. Beforehand we used --cluster_mode=yes, but i didn't see a way to let the operator configure the cluster automatically so reverted back to emulated.

I confirmed that it is also really using it by:

checking the args of the pod
checking the error message from our redis cluster client changing from something like "cluster is not yet setup" to the given error above

All node ids of slaves are in fact still the same for the slaves. Not sure how dragonfly is handling/creating these (e.g. maybe it was intentional to signal some state like both being data-wise exactly the same).

I did now also check six other test clusters i did set up yesterday and every of them has the same node id for the two slaves (3 replica setup => 1 master, 2 slaves).

I am using the dragonfly-operator (https://raw.githubusercontent.com/dragonflydb/dragonfly-operator/v1.1.1/manifests/dragonfly-operator.yaml), kustomized to use a different namespace and with image docker.dragonflydb.io/dragonflydb/operator:v1.1.1 to create the dragonfly clusters.

This is an example resource, i used to spawn one of the clusters:

apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  name: dragonfly-app
spec:
  image: docker.dragonflydb.io/dragonflydb/dragonfly:v1.15.1
  args:
    - '--cache_mode'
    - '--primary_port_http_enabled=true'
    - '--cluster_mode=emulated'
  snapshot:
    cron: '*/5 * * * *'
    persistentVolumeClaimSpec:
      resources:
        requests:
          storage: 1Gi
      accessModes:
        - ReadWriteOnce
  resources:
    limits:
      cpu: 100m
      memory: 320Mi
    requests:
      cpu: 100m
      memory: 320Mi
  replicas: 3
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - topologyKey: kubernetes.io/hostname
          labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - dragonfly-app

Let me know if i can provide any additional information that helps tracking this down.

romange commented 7 months ago

@applike-ss can you please check whether this package https://github.com/dragonflydb/dragonfly/pkgs/container/dragonfly-weekly/191456274?tag=e8650ed2b4ebd550c966751dd33ebb1ac4f82b1f-ubuntu solves the issue?

(it's built from https://github.com/dragonflydb/dragonfly/pull/2731)

chakaz commented 7 months ago

Looking more closely, I see that somehow Dragonfly has 2 replicas with the same node ID (c23888188f5d350b552aa8d8aa7ad40a05765b26). That is impossibly unlikely to be actual 2 distinct replicas, so I wonder - did something bad happen during that time? Like a replica changed IP address for some reason? Can you please try and see if this is the reason for the error (i.e. try running your app with a single replica in a good state)?

I filed https://github.com/dragonflydb/dragonfly/issues/2734 for that :arrow_up: issue.

applike-ss commented 7 months ago

@applike-ss can you please check whether this package https://github.com/dragonflydb/dragonfly/pkgs/container/dragonfly-weekly/191456274?tag=e8650ed2b4ebd550c966751dd33ebb1ac4f82b1f-ubuntu solves the issue?

(it's built from #2731)

It looks to be running now with the redis-cluster implementation of the lettuce library.

@chakaz is right to say that the slave ids are still the same. It doesn't seem to be an issue currently for us though.

dragonflydb / dragonfly

Cluster nodes command does not seem to be 100% redis compatible #2726