Master/Slave connection flapping

GoogleCodeExporter commented 8 years ago

What version of Redis you are using, in what kind of Operating System?
Redis 2.2.4, master and slave
Slave uses TCMALLOC, Master does not
Linux
Dedicated hardware

What is the problem you are experiencing?
Very strange behavior where the master-slave connection times out and 
reconnects repeatedly, forcing multiple BGSAVE for SYNCs.  From what we can 
tell, the network is solid.  Related to this, a client that is connected to the 
master has a keepalive ping that occasionally takes exactly 107.3 seconds to 
return (but it does enventually return), though the period between events is 
inconsistent -- anywhere between 5 and 14 minutes.

What steps will reproduce the problem?
Great question!  Not sure yet, but still digging.
Do you have an INFO output? Please past it here.
Info captured during the period of time that the client does not immediately 
receive a response from the master.

Master:
redis_version:2.2.4
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
process_id:12013
uptime_in_seconds:43962
uptime_in_days:0
lru_clock:256199
used_cpu_sys:1.67
used_cpu_user:2.54
used_cpu_sys_childrens:0.68
used_cpu_user_childrens:0.26
connected_clients:3
connected_slaves:1
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
used_memory:1752264
used_memory_human:1.67M
used_memory_rss:2899968
mem_fragmentation_ratio:1.65
use_tcmalloc:0
loading:0
aof_enabled:0
changes_since_last_save:1
bgsave_in_progress:0
last_save_time:1302795404
bgrewriteaof_in_progress:0
total_connections_received:198
total_commands_processed:2693
expired_keys:1
evicted_keys:0
keyspace_hits:733
keyspace_misses:23
hash_max_zipmap_entries:64
hash_max_zipmap_value:512
pubsub_channels:0
pubsub_patterns:0
vm_enabled:0
role:master
allocation_stats:2=204,6=1,7=1,8=2798,9=696,10=5798,11=212,12=590,13=2737,14=188
,15=6540,16=27115,17=866,18=1740,19=149,20=167,21=32,22=293,23=154,24=8117,25=7,
26=5,27=105,28=6,29=6,30=10,31=4,32=2047,34=5,35=6,36=9,37=52,38=27,39=66,40=122
,41=6,42=170,43=4,44=2,45=7,46=4,47=43,48=810,49=2,50=25,51=134,52=328,53=3,55=1
,56=22,57=7,58=4,59=87,60=159,61=1,62=1,63=1,64=958,65=10,66=1,67=3,68=4,69=4,70
=24,71=8,72=51,73=4,74=3,75=13,76=3,77=3,78=7,79=7,80=13,81=18,82=20,83=33,84=25
,85=18,86=16,87=12,88=1153,89=9,90=3,91=17,99=2,100=22,101=1,107=2,108=16,114=6,
115=2,117=16,123=2,127=16,128=10,129=8,131=2,139=2,141=16,147=3,155=2,157=2,163=
2,165=1,167=1,171=2,176=16,179=2,181=10,182=6,187=2,195=2,203=2,211=2,213=1,219=
2,223=1,227=2,233=1,235=2,243=3,244=16,251=8,>=256=1296
db0:keys=1153,expires=0

Slave:
redis_version:2.2.4
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
process_id:8561
uptime_in_seconds:822
uptime_in_days:0
lru_clock:256198
used_cpu_sys:0.07
used_cpu_user:0.14
used_cpu_sys_childrens:0.00
used_cpu_user_childrens:0.00
connected_clients:3
connected_slaves:0
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
used_memory:1580400
used_memory_human:1.51M
used_memory_rss:3108864
mem_fragmentation_ratio:1.97
use_tcmalloc:1
loading:0
aof_enabled:0
changes_since_last_save:1
bgsave_in_progress:0
last_save_time:1302795403
bgrewriteaof_in_progress:0
total_connections_received:5
total_commands_processed:192
expired_keys:0
evicted_keys:0
keyspace_hits:1
keyspace_misses:0
hash_max_zipmap_entries:512
hash_max_zipmap_value:64
pubsub_channels:0
pubsub_patterns:0
vm_enabled:0
role:slave
master_host:10.0.0.5
master_port:6379
master_link_status:up
master_last_io_seconds_ago:34
master_sync_in_progress:0
allocation_stats:2=408,6=1,7=2,8=2092,9=65,10=11412,11=304,12=1022,13=495,14=158
,15=4264,16=33369,17=271,18=3441,19=206,20=95,21=58,22=486,23=89,24=15069,25=9,2
6=5,27=12,28=6,29=6,30=16,31=2,32=2244,34=6,35=9,36=10,37=5,38=45,39=116,40=105,
41=10,42=338,43=5,44=4,45=11,46=5,47=1,48=48,49=2,50=93,51=265,52=655,53=2,56=8,
57=12,58=7,59=169,60=318,61=1,62=1,63=1,64=1898,65=15,66=1,67=5,68=4,69=4,70=46,
71=11,72=98,73=4,74=3,75=5,76=3,77=3,78=7,79=7,80=13,81=18,82=20,83=57,84=25,85=
18,86=19,87=12,88=1900,89=9,90=3,91=6,99=6,100=44,101=2,107=4,108=32,112=2,114=1
2,115=4,117=32,123=4,125=2,127=32,128=5,131=5,138=2,139=4,141=32,147=6,151=2,155
=4,157=4,163=4,164=2,167=2,171=4,176=32,177=2,179=4,182=12,187=4,190=2,195=4,203
=6,211=4,213=2,217=2,219=4,223=2,227=4,231=2,233=2,235=4,243=6,244=32,245=2,251=
16,>=256=2355
db0:keys=1153,expires=0

If it is a crash, can you please paste the stack trace that you can find in
the log file or on standard output? This is really useful for us!

Please provide any additional information below.
Using loglevel DEBUG on both master and slave, there is no log entry at the 
time that the "long ping" is initiated by the client, and there are log entries 
within a few seconds on either side of that time on both master and slave.  

At one point when the client was waiting for a response, I was able to 
immediately connect to both master and slave via redis-cli and get the info 
dumps, so neither process was blocked.  I've also been watching top and vmstat 
on both boxes -- no high cpu, no swapping, nothing out of the ordinary during 
the event.

Looking at packets on the machine that the master is running, the ping packets 
just stop during the event (which is expected, since the ping is blocking on 
the client side).

From the following, it doesn't appear that the master sees that the connection 
to the slave is down, but the slave detects it is down and reconnects:

Master:
[19584] 14 Apr 10:41:47 - 1 clients connected (1 slaves), 1740336 bytes in use
[19584] 14 Apr 10:41:52 - DB 0: 1152 keys (0 volatile) in 2048 slots HT.
[19584] 14 Apr 10:41:52 - 1 clients connected (1 slaves), 1740336 bytes in use
[19584] 14 Apr 10:41:53 - Accepted 10.0.0.6:56416
[19584] 14 Apr 10:41:53 * Slave ask for synchronization
[19584] 14 Apr 10:41:53 * Starting BGSAVE for SYNC
[19584] 14 Apr 10:41:53 * Background saving started by pid 19876
[19876] 14 Apr 10:41:53 * DB saved on disk
[19584] 14 Apr 10:41:53 * Background saving terminated with success
[19584] 14 Apr 10:41:53 * Synchronization with slave succeeded
[19584] 14 Apr 10:41:53 - Client closed connection
[19584] 14 Apr 10:41:57 - DB 0: 1152 keys (0 volatile) in 2048 slots HT.
[19584] 14 Apr 10:41:57 - 1 clients connected (1 slaves), 1740336 bytes in use
[19584] 14 Apr 10:42:02 - DB 0: 1152 keys (0 volatile) in 2048 slots HT.

Slave:
[8561] 14 Apr 10:41:46 - 2 clients connected (0 slaves), 1572280 bytes in use
[8561] 14 Apr 10:41:51 - DB 0: 1152 keys (0 volatile) in 3072 slots HT.
[8561] 14 Apr 10:41:51 - 2 clients connected (0 slaves), 1572280 bytes in use
[8561] 14 Apr 10:41:52 # MASTER time out: no data nor PING received...
[8561] 14 Apr 10:41:52 * Connecting to MASTER...
[8561] 14 Apr 10:41:52 * MASTER <-> SLAVE sync started: SYNC sent
[8561] 14 Apr 10:41:52 * MASTER <-> SLAVE sync: receiving 147029 bytes from 
master
[8561] 14 Apr 10:41:52 * MASTER <-> SLAVE sync: Loading DB in memory
[8561] 14 Apr 10:41:52 . LOADING OBJECT 2 (at 22)
[8561] 14 Apr 10:41:52 . LOADING OBJECT 2 (at 44)

Sorry for the long post, but this is a real head-scratcher.  No VM on either 
box.  The only other thing that's non standard is that I have activerehashing 
turned off on the slave.  slave-serve-stale-data=yes.

What is the connection timeout for the master/slave connection?  Can it be 
adjusted?  What is the threshold for detecting a disconnected master by the 
slave--is it a single missing ping, or multiples?

Cheers!

-Dean

Original issue reported on code.google.com by deanwba...@gmail.com on 14 Apr 2011 at 6:14

GoogleCodeExporter commented 8 years ago

One more piece of data -- in reference to the log posts at the bottom,
the "long ping" was issued at 10:40:44, and returned at 10:42:31.
Both processes seem to be up and happy during that time.

Thanks for any insights!

-Dean

Original comment by deanwba...@gmail.com on 14 Apr 2011 at 7:44

GoogleCodeExporter commented 8 years ago

Still looking into root cause for this issue.  While using redis-cli, I noticed 
that a simple DEL command appeared to hang, so I quickly fired up another 
redis-cli connection to the same server to get an info dump.  The second 
session connected normally and returned INFO results without delay (see below).

So what I'm seeing is that the master just ignores one connected client for a 
long period of time (107.3 seconds, consistently), during which other clients 
are not blocked (at least clients that connect after the blocking starts).  
There is nothing suspicious in the master's logfile (set to DEBUG level 
logging).

I'm less convinced that this is related to the master/slave configuration, but 
still digging...any suggestions as to what else to look for are welcome!

INFO from Master:

redis_version:2.2.4
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
process_id:19584
uptime_in_seconds:428064
uptime_in_days:4
lru_clock:299559
used_cpu_sys:23.18
used_cpu_user:41.58
used_cpu_sys_childrens:18.78
used_cpu_user_childrens:6.35
connected_clients:26
connected_slaves:1
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:1
used_memory:4056784
used_memory_human:3.87M
used_memory_rss:5447680
mem_fragmentation_ratio:1.34
use_tcmalloc:0
loading:0
aof_enabled:0
changes_since_last_save:0
bgsave_in_progress:0
last_save_time:1303229826
bgrewriteaof_in_progress:0
total_connections_received:7505
total_commands_processed:112278
expired_keys:391
evicted_keys:0
keyspace_hits:78584
keyspace_misses:16666
hash_max_zipmap_entries:64
hash_max_zipmap_value:512
pubsub_channels:0
pubsub_patterns:0
vm_enabled:0
role:master
allocation_stats:2=2035,6=1,7=1,8=28229,9=358775,10=34286,11=11109,12=38353,13=1
02671,14=15433,15=60355,16=1236807,17=382380,18=18275,19=49790,20=4894,21=4904,2
2=6028,23=4091,24=64686,25=2708,26=4169,27=1133,28=6,29=2173,30=421,31=13,32=786
09,33=17,34=5,35=240,36=28,37=465,38=848,39=2372,40=712,41=4741,42=170,43=4144,4
4=2,45=55,46=19,47=300,48=38067,49=6172,50=1768,51=1139,52=328,53=14,55=2024,56=
3501,57=19,58=54,59=261,60=161,61=34,62=1,63=12,64=1117,65=13,66=1,67=224,68=113
,69=1355,70=1750,71=337,72=1798,73=26,74=3,75=246,76=3,77=21,78=7,79=176,80=50,8
1=44,82=711,83=1139,84=105,85=23,86=86,87=45,88=8461,89=9,90=10,91=339,93=13,96=
1359,97=13,98=10,99=889,100=1240,101=206,102=17,103=2225,104=9,105=3,106=577,107
=905,108=179,109=129,110=12,111=17,112=503,113=525,114=851,115=183,116=29,117=34
3,118=46,119=11,120=499,121=373,122=118,123=95,124=23,125=33,126=300,127=211,128
=256,129=48,130=56,131=282,132=109,133=1360,134=10,135=155,136=261,137=66,138=12
42,139=104,140=18,141=50,142=29,143=297,144=42,145=33,146=12,147=49,148=54,149=4
8,150=11,151=56,152=20,153=10,154=12,155=52,156=8,157=12,158=8,159=7,160=13,161=
36,162=191,163=198,164=17,165=21,166=4,167=37,168=1,169=2,170=111,171=33,172=1,1
73=5,174=7,175=36,176=91,177=148,178=33,179=50,180=5,181=80,182=33,183=18,184=72
,185=12,186=24,187=40,188=11,189=7,190=18,191=12,192=158,193=13,194=4,195=52,198
=1,199=1941,200=1,203=65,205=11,206=160,207=9,208=2,209=66,210=344,211=77,212=10
,213=11,214=4,215=15,216=29,217=36,218=9,219=91,220=14,221=14,222=37,223=28,224=
166,225=134,226=7,227=72,228=11,229=71,230=95,231=287,232=453,233=23,234=26,235=
94,236=41,237=79,238=41,239=189,240=5,241=7,242=4,243=67,244=60,245=48,246=59,24
7=21,248=32,249=61,250=12,251=59,252=32,253=30,254=171,255=19,>=256=56742
db0:keys=2981,expires=0

Original comment by deanwba...@gmail.com on 19 Apr 2011 at 5:18

GoogleCodeExporter commented 8 years ago

I've traced this back to a packet loss issue between nodes.  It has nothing to 
do with Redis.  Please close the ticket.

Original comment by deanwba...@gmail.com on 21 Apr 2011 at 4:01

GoogleCodeExporter commented 8 years ago

Dean, sorry for not replying earlier here. Good to see you were able to resolve 
this.

Original comment by pcnoordh...@gmail.com on 21 Apr 2011 at 7:25

Changed state: Invalid

Lachim / redis

Master/Slave connection flapping #522