Failure Replication Master/Slave on big linux cluster

GoogleCodeExporter commented 9 years ago

What version of Redis you are using, in what kind of Operating System?
Redis server version 2.0.4
on Linux Debian x86_64 GNU/Linux

What is the problem you are experiencing?
We make a redis master and 40 slave machine in a private datacenter, i put 1 
milion of hash entry on master, and when i start the slave there was no 
problem. but when i add/delete data on master, the slave has not the same data, 
i mean that 1/4 of our machine is not allineated.

What steps will reproduce the problem?
when i add/remove data on the master

Do you have an INFO output? Please past it here.
the command /usr/local/redis/redis-cli dbsize on all the slave machine is put 
in the attach file

If it is a crash, can you please paste the stack trace that you can find in
the log file or on standard output? This is really useful for us!

Please provide any additional information below.

Original issue reported on code.google.com by michaelp...@gmail.com on 23 Feb 2011 at 2:12

Attachments:

redis.txt

GoogleCodeExporter commented 9 years ago

Are you by any chance using expiration on these keys? If so, keys might be 
expired on some of the slaves and not on others, because expiration is not an 
exact process.

Otherwise: are you seeing strange things in the log of the nodes that end up 
with an incomplete data set?

Original comment by pcnoordh...@gmail.com on 23 Feb 2011 at 2:47

GoogleCodeExporter commented 9 years ago

Forgot to mention: the expiration process for slaves is changed in 2.2. In 2.0, 
keys that have an associated expiry will be expired by all slaves individually 
(which was the reason for not allowing writes against expiring values). In 2.2, 
the master explicitly feeds DEL commands to the slaves for keys that expiry. 
Expiry is only done on the master. For this reason, all nodes will have the 
same number of keys even when using expiry.

Original comment by pcnoordh...@gmail.com on 23 Feb 2011 at 2:52

GoogleCodeExporter commented 9 years ago

> Are you by any chance using expiration on these keys?

No. We're currently using hashes with no expiration. It's intended to be a 
persistent db.

> Are you seeing strange things in the log of the nodes that end up with an 
incomplete data set?

This is what I see on a properly working slave:

[25782] 23 Feb 15:53:25 - DB 0: 964104 keys (0 volatile) in 1048576 slots HT.
[25782] 23 Feb 15:53:25 - 1 clients connected (0 slaves), 153544512 bytes in use
[25782] 23 Feb 15:53:30 - DB 0: 964104 keys (0 volatile) in 1048576 slots HT.
[25782] 23 Feb 15:53:30 - 1 clients connected (0 slaves), 153544512 bytes in use
[25782] 23 Feb 15:53:34 - Reading from client: Connection reset by peer
[25782] 23 Feb 15:53:34 * Connecting to MASTER... 
[25782] 23 Feb 15:53:34 * Receiving 82267668 bytes data dump from MASTER
[25782] 23 Feb 15:53:55 * MASTER <-> SLAVE sync succeeded
[25782] 23 Feb 15:53:55 * Background append only file rewriting started by pid 
4412
[25782] 23 Feb 15:53:56 - DB 0: 964104 keys (0 volatile) in 1048576 slots HT.
[25782] 23 Feb 15:53:56 - 1 clients connected (0 slaves), 153544512 bytes in use
[25782] 23 Feb 15:54:01 - DB 0: 964104 keys (0 volatile) in 1048576 slots HT.
[25782] 23 Feb 15:54:01 - 1 clients connected (0 slaves), 153544512 bytes in use
[4412] 23 Feb 15:54:04 * SYNC append only file rewrite performed
[25782] 23 Feb 15:54:04 * Background append only file rewriting terminated with 
success
[25782] 23 Feb 15:54:04 * Parent diff flushed into the new append log file with 
success (0 bytes)
[25782] 23 Feb 15:54:04 * Append only file successfully rewritten.
[25782] 23 Feb 15:54:04 * The new append only file was selected for future 
appends.
[25782] 23 Feb 15:54:06 - DB 0: 964104 keys (0 volatile) in 1048576 slots HT.
[25782] 23 Feb 15:54:06 - 1 clients connected (0 slaves), 153544512 bytes in use

On the outdated slaves I can only see this:

[23732] 23 Feb 10:06:08 - DB 0: 960368 keys (0 volatile) in 1048576 slots HT.
[23732] 23 Feb 10:06:08 - 1 clients connected (0 slaves), 152829744 bytes in use

No master-slave information at all.

Original comment by mostros...@gmail.com on 23 Feb 2011 at 3:06

GoogleCodeExporter commented 9 years ago

we try also with "appendonly" set to "yes" and "no".
we try also to compare by `strace` the dated and outdated slaves, but no 
difference at all.

Original comment by michaelp...@gmail.com on 23 Feb 2011 at 3:10

GoogleCodeExporter commented 9 years ago

Hello Michael, can you please send INFO output of a master, a slave with the 
desync, and a slave without the desync? This can help us.

Original comment by anti...@gmail.com on 23 Feb 2011 at 3:19

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

I'm Daniele and I'm the actual admin of these instances.

Master:
redis_version:2.0.4
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:32
multiplexing_api:epoll
process_id:11151
uptime_in_seconds:3730879
uptime_in_days:43
connected_clients:1
connected_slaves:30
blocked_clients:0
used_memory:153559760
used_memory_human:146.45M
changes_since_last_save:1
bgsave_in_progress:0
last_save_time:1298474330
bgrewriteaof_in_progress:0
total_connections_received:2088446
total_commands_processed:2088466
expired_keys:0
hash_max_zipmap_entries:64
hash_max_zipmap_value:512
pubsub_channels:0
pubsub_patterns:0
vm_enabled:0
role:master
db0:keys=964105,expires=0

Sync slave:
redis_version:2.0.4
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:32
multiplexing_api:epoll
process_id:25782
uptime_in_seconds:602948
uptime_in_days:6
connected_clients:2
connected_slaves:0
blocked_clients:0
used_memory:153545000
used_memory_human:146.43M
changes_since_last_save:0
bgsave_in_progress:0
last_save_time:1298474350
bgrewriteaof_in_progress:0
total_connections_received:7
total_commands_processed:1600743
expired_keys:0
hash_max_zipmap_entries:64
hash_max_zipmap_value:512
pubsub_channels:0
pubsub_patterns:0
vm_enabled:0
role:slave
master_host:x.x.x.x
master_port:6379
master_link_status:up
master_last_io_seconds_ago:246
db0:keys=964105,expires=0

Out of sync slave:
redis_version:2.0.4                                                             

redis_git_sha1:00000000                                                         

redis_git_dirty:0                                                               

arch_bits:32                                                                    

multiplexing_api:epoll                                                          

process_id:23732                                                                

uptime_in_seconds:602840                                                        

uptime_in_days:6
connected_clients:2
connected_slaves:0
blocked_clients:0
used_memory:152830064
used_memory_human:145.75M
changes_since_last_save:0
bgsave_in_progress:0
last_save_time:1298062237
bgrewriteaof_in_progress:0
total_connections_received:11
total_commands_processed:1555860
expired_keys:0
hash_max_zipmap_entries:64
hash_max_zipmap_value:512
pubsub_channels:0
pubsub_patterns:0
vm_enabled:0
role:slave
master_host:x.x.x.x
master_port:6379
master_link_status:up
master_last_io_seconds_ago:412446
db0:keys=960368,expires=0

Original comment by mostros...@gmail.com on 23 Feb 2011 at 3:24

GoogleCodeExporter commented 9 years ago

Ok, what happens here is that one of the server is actually disconnected:

master_last_io_seconds_ago:412446

This is due to lame 2.0 detection of the status of the link. if the socket does 
not close it remains in this state.

2.2 (now stable) introduced explicit ping in the master -> slave link. 
Upgrading is probably the best thing to do.
Otherwise you should check with a script when this happens, and reissue a 
SLAVEOF command to this slave to force reconnection.

Cheers,
Salvatore

Original comment by anti...@gmail.com on 23 Feb 2011 at 3:35

GoogleCodeExporter commented 9 years ago

Thank you very much;

i try with
redis-cli slaveof $master $port
i get OK

and when i try with
redis-cli info
it keep 5 second to keep 1 milion of datas.

Thanks Salvatore

Original comment by michaelp...@gmail.com on 23 Feb 2011 at 3:45

bestvivi / redis

Failure Replication Master/Slave on big linux cluster #469