Redis replication no-such-master

GSergeevich commented 5 years ago

Centos 7 pacemaker-libs-1.1.19-8.el7_6.4.x86_64 pacemaker-cluster-libs-1.1.19-8.el7_6.4.x86_64 pacemaker-1.1.19-8.el7_6.4.x86_64 userspace-rcu-0.10.0-3.el7.x86_64 pacemaker-cli-1.1.19-8.el7_6.4.x86_64

When I create pcaemaker resource:

pcs resource create p_redis ocf:heartbeat:redis op monitor timeout="60s" interval="45s" op monitor role="Master" timeout="60s" interval="20s" op monitor role="Slave" timeout="60s" interval="60s" master config="/etc/redis.conf"

pcs resource show is:

 Master/Slave Set: p_redis-master [p_redis]
     Masters: [ control1 ]
     Slaves: [ control2 control3 ]

but slave nodes can't connect to master, which is set to "no-such-master":

34480:S 12 Apr 13:03:08.104 * Connecting to MASTER no-such-master:6379 34480:S 12 Apr 13:03:08.105 # Unable to connect to MASTER: No such file or directory

redis-cli info replication:

control1 | CHANGED | rc=0 >>

role:master connected_slaves:0 master_repl_offset:0 repl_backlog_active:0 repl_backlog_size:1048576 repl_backlog_first_byte_offset:0 repl_backlog_histlen:0

control2 | CHANGED | rc=0 >>

role:slave master_host:no-such-master master_port:6379 master_link_status:down master_last_io_seconds_ago:-1 master_sync_in_progress:0 slave_repl_offset:1 master_link_down_since_seconds:1555063834 slave_priority:100 slave_read_only:1 connected_slaves:0 master_repl_offset:0 repl_backlog_active:0 repl_backlog_size:1048576 repl_backlog_first_byte_offset:0 repl_backlog_histlen:0

control3 | CHANGED | rc=0 >>

role:slave master_host:no-such-master master_port:6379 master_link_status:down master_last_io_seconds_ago:-1 master_sync_in_progress:0 slave_repl_offset:1 master_link_down_since_seconds:1555063834 slave_priority:100 slave_read_only:1 connected_slaves:0 master_repl_offset:0 repl_backlog_active:0 repl_backlog_size:1048576 repl_backlog_first_byte_offset:0 repl_backlog_histlen:0

/etc/redis.conf :

activerehashing yes aof-load-truncated yes aof-rewrite-incremental-fsync yes appendfilename "appendonly.aof" appendfsync everysec appendonly no auto-aof-rewrite-min-size 64mb auto-aof-rewrite-percentage 100 bind 10.78.201.16 127.0.0.1 client-output-buffer-limit normal 0 0 0 client-output-buffer-limit pubsub 32mb 8mb 60 client-output-buffer-limit slave 256mb 64mb 60 daemonize no databases 16 dbfilename dump.rdb dir /var/lib/redis hash-max-ziplist-entries 512 hash-max-ziplist-value 64 hll-sparse-max-bytes 3000 hz 10 latency-monitor-threshold 0 list-compress-depth 0 list-max-ziplist-size -2 logfile /var/log/redis/redis.log loglevel notice lua-time-limit 5000 no-appendfsync-on-rewrite no notify-keyspace-events "" pidfile /var/run/redis_6379.pid port 6379 protected-mode yes rdbchecksum yes rdbcompression yes repl-disable-tcp-nodelay no repl-diskless-sync-delay 5 repl-diskless-sync no save 300 10 save 60 10000 save 900 1 set-max-intset-entries 512 slave-priority 100 slave-read-only yes slave-serve-stale-data yes slowlog-log-slower-than 10000 slowlog-max-len 128 stop-writes-on-bgsave-error yes supervised no tcp-backlog 511 tcp-keepalive 300 timeout 0 zset-max-ziplist-entries 128 zset-max-ziplist-value 64

maniaque commented 5 years ago

Same story for me. Dual node setup, same problems.

But if I start the resource when only one node is available, then add start second node, everything works perfectly.

GSergeevich commented 4 years ago

Hello! When I create resource:

pcs resource create p_redis ocf:heartbeat:redis promotable meta master-max=1 notify=true interleave=true client-bin=/usr/bin/redis-cli config=/etc/redis.conf port=6379

# pcs status
Cluster name: debian
Stack: corosync
Current DC: p00sqldb01 (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Wed Apr 22 10:51:52 2020
Last change: Wed Apr 22 10:48:20 2020 by root via cibadmin on p00sqldb01

4 nodes configured
5 resources configured

Online: [ p00sqldb01 p00sqldb02 p00sqldb03 ]

Full list of resources:

 Resource Group: vip_group
     vip    (ocf::heartbeat:IPaddr2):   Started p00sqldb01
 Clone Set: p_redis-clone [p_redis] (promotable)
     Masters: [ p00sqldb02 ]
     Slaves: [ p00sqldb01 p00sqldb03 ]

On slaves errors like this:

12796:S 22 Apr 2020 10:45:29.891 * Connecting to MASTER no-such-master:6379
12796:S 22 Apr 2020 10:45:29.942 # Unable to connect to MASTER: Invalid argument

If add in /etc/hosts record :

10.16.190.105 p00sqldb02 no-such-master ( p00sqldb02 is a redis-master host)

All work fine:

13431:S 22 Apr 2020 10:50:04.492 * Connecting to MASTER no-such-master:6379
13431:S 22 Apr 2020 10:50:04.493 * MASTER <-> REPLICA sync started
13431:S 22 Apr 2020 10:50:04.493 * Non blocking connect for SYNC fired the event.
13431:S 22 Apr 2020 10:50:04.495 * Master replied to PING, replication can continue...
13431:S 22 Apr 2020 10:50:04.496 * Trying a partial resynchronization (request 60331ea41c43ac7e7a47cefa87b450ed31b12bf5:1).
13431:S 22 Apr 2020 10:50:04.499 * Full resync from master: da0129cbf660b929a71248794ed16d927581765d:0
13431:S 22 Apr 2020 10:50:04.499 * Discarding previously cached master state.
13431:S 22 Apr 2020 10:50:04.547 * MASTER <-> REPLICA sync: receiving 193 bytes from master
13431:S 22 Apr 2020 10:50:04.548 * MASTER <-> REPLICA sync: Flushing old data
13431:S 22 Apr 2020 10:50:04.548 * MASTER <-> REPLICA sync: Loading DB in memory
13431:S 22 Apr 2020 10:50:04.548 * MASTER <-> REPLICA sync: Finished with success

The problem is apparently that the script cannot find the hostname of the current master.

tayeh commented 2 years ago

@GSergeevich did you fix this?

ClusterLabs / resource-agents

Redis replication no-such-master #1319