Closed applike-ss closed 7 months ago
@applike-ss hey, thanks for the bug report!
(1) Can you confirm that you run Dragonfly in emulated cluster mode? (--cluster_mode=emulated
), as opposed to a real cluster
(2) Can you maybe create minimal repro instructions? I don't know flink nor lettuce I'm afraid
Looking more closely, I see that somehow Dragonfly has 2 replicas with the same node ID (c23888188f5d350b552aa8d8aa7ad40a05765b26
). That is impossibly unlikely to be actual 2 distinct replicas, so I wonder - did something bad happen during that time? Like a replica changed IP address for some reason?
Can you please try and see if this is the reason for the error (i.e. try running your app with a single replica in a good state)?
Ok, I think I got it. It looks like Redis replies with just 0A
(\n
) between lines, while Dragonfly replies with 0D0A
(\r\n
):
Dragonfly:
00000000: 3530 6134 3139 3933 3030 6564 3837 3263 50a4199300ed872c
00000010: 3066 3434 6563 3765 3639 3333 6661 3430 0f44ec7e6933fa40
00000020: 3961 3261 3831 6538 2031 3237 2e30 2e30 9a2a81e8 127.0.0
00000030: 2e31 3a36 3337 3940 3633 3739 206d 7973 .1:6379@6379 mys
00000040: 656c 662c 6d61 7374 6572 202d 2030 2030 elf,master - 0 0
00000050: 2030 2063 6f6e 6e65 6374 6564 2030 2d31 0 connected 0-1
00000060: 3633 3833 0d0a 6383..
Redis:
00000000: 3230 6337 6535 3939 6432 3665 6638 3133 20c7e599d26ef813
00000010: 3932 3861 3061 6138 6438 3865 3464 6262 928a0aa8d88e4dbb
00000020: 3838 3334 3730 3961 2031 3237 2e30 2e30 8834709a 127.0.0
00000030: 2e31 3a37 3030 3240 3137 3030 3220 6d61 .1:7002@17002 ma
00000040: 7374 6572 202d 2030 2031 3731 3034 3435 ster - 0 1710445
00000050: 3733 3237 3136 2033 2063 6f6e 6e65 6374 732716 3 connect
00000060: 6564 2031 3039 3233 2d31 3633 3833 0a ed 10923-16383.
[...]
We do that both in CLUSTER INFO
and in CLUSTER NODES
, but for some reason Redis replies with \r\n
for INFO
but with only \n
for NODES
:shrug:
Redis CLUSTER INFO
:
00000000: 636c 7573 7465 725f 7374 6174 653a 6f6b cluster_state:ok
00000010: 0d0a 636c 7573 7465 725f 736c 6f74 735f ..cluster_slots_
00000020: 6173 7369 676e 6564 3a31 3633 3834 0d0a assigned:16384..
00000030: 636c 7573 7465 725f 736c 6f74 735f 6f6b cluster_slots_ok
00000040: 3a31 3633 3834 0d0a 636c 7573 7465 725f :16384..cluster_
[...]
Anyway, for some reason flink/lettuce is probably sensitive to that, so we should be compatible.
I'm still curious about the 2 nodes with the same ID situation you got there though.
@applike-ss hey, thanks for the bug report!
(1) Can you confirm that you run Dragonfly in emulated cluster mode? (
--cluster_mode=emulated
), as opposed to a real cluster (2) Can you maybe create minimal repro instructions? I don't know flink nor lettuce I'm afraid
Regarding the emulated cluster mode, yes we do use that in our dragonfly test setup. Beforehand we used --cluster_mode=yes, but i didn't see a way to let the operator configure the cluster automatically so reverted back to emulated.
I confirmed that it is also really using it by:
All node ids of slaves are in fact still the same for the slaves. Not sure how dragonfly is handling/creating these (e.g. maybe it was intentional to signal some state like both being data-wise exactly the same).
I did now also check six other test clusters i did set up yesterday and every of them has the same node id for the two slaves (3 replica setup => 1 master, 2 slaves).
I am using the dragonfly-operator (https://raw.githubusercontent.com/dragonflydb/dragonfly-operator/v1.1.1/manifests/dragonfly-operator.yaml), kustomized to use a different namespace and with image docker.dragonflydb.io/dragonflydb/operator:v1.1.1
to create the dragonfly clusters.
This is an example resource, i used to spawn one of the clusters:
apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
name: dragonfly-app
spec:
image: docker.dragonflydb.io/dragonflydb/dragonfly:v1.15.1
args:
- '--cache_mode'
- '--primary_port_http_enabled=true'
- '--cluster_mode=emulated'
snapshot:
cron: '*/5 * * * *'
persistentVolumeClaimSpec:
resources:
requests:
storage: 1Gi
accessModes:
- ReadWriteOnce
resources:
limits:
cpu: 100m
memory: 320Mi
requests:
cpu: 100m
memory: 320Mi
replicas: 3
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- dragonfly-app
Let me know if i can provide any additional information that helps tracking this down.
@applike-ss can you please check whether this package https://github.com/dragonflydb/dragonfly/pkgs/container/dragonfly-weekly/191456274?tag=e8650ed2b4ebd550c966751dd33ebb1ac4f82b1f-ubuntu solves the issue?
(it's built from https://github.com/dragonflydb/dragonfly/pull/2731)
Looking more closely, I see that somehow Dragonfly has 2 replicas with the same node ID (
c23888188f5d350b552aa8d8aa7ad40a05765b26
). That is impossibly unlikely to be actual 2 distinct replicas, so I wonder - did something bad happen during that time? Like a replica changed IP address for some reason? Can you please try and see if this is the reason for the error (i.e. try running your app with a single replica in a good state)?
I filed https://github.com/dragonflydb/dragonfly/issues/2734 for that :arrow_up: issue.
@applike-ss can you please check whether this package https://github.com/dragonflydb/dragonfly/pkgs/container/dragonfly-weekly/191456274?tag=e8650ed2b4ebd550c966751dd33ebb1ac4f82b1f-ubuntu solves the issue?
(it's built from #2731)
It looks to be running now with the redis-cluster implementation of the lettuce library.
@chakaz is right to say that the slave ids are still the same. It doesn't seem to be an issue currently for us though.
Describe the bug We are running flink applications using the lettuce redis library. It seems to struggle with the output of dragonflies response to the
cluster nodes
command. Somehow it seems there is a difference to what it expects.Output from dragonfly:
Output from redis:
The exception from lettuce:
Expected behavior redis' and dragonfly should have compatible output
Environment (please complete the following information):