Seeks times increasing on redis slave after executing large scale deletes on master

GoogleCodeExporter commented 8 years ago

What version of Redis you are using, in what kind of Operating System?
2.2.2, FC8
What is the problem you are experiencing?
I have a master that is running Redis 2.2.2. The slave replicates over the 
internet. The slave has 3X master's read volume but naturally no write volume. 
Before the deletes both master and slave use barely 1-2% of the CPU and seek 
times are typically 1-2ms. 

I run weekly deletes of data on the master. The deletes remove anywhere from 
10% to 30% of the keys from the master. After and during the deletes everything 
on the master is normal. On the slave redis instance hell breaks loose. The 
slave redis instance starts using more and more CPU (see sar output below). 
Response times increase to 20/30ms and more. The redis instance is seen on the 
top of the top output with spikes upto 50% cpu utilization. The only way to 
stop this behavior is to take a bgsave and restart redis on the slave.

What steps will reproduce the problem?

Do you have an INFO output? Please past it here.
redis_version:2.2.2
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
uptime_in_seconds:1287972
uptime_in_days:14
lru_clock:771807
used_cpu_sys:1350.63
used_cpu_user:73.74
used_cpu_sys_childrens:0.00
used_cpu_user_childrens:0.00
connected_clients:302
connected_slaves:0
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
used_memory:5815509712
used_memory_human:5.42G
used_memory_rss:8861392896
mem_fragmentation_ratio:1.52
use_tcmalloc:0
loading:0
aof_enabled:0
changes_since_last_save:31837945
bgsave_in_progress:0
last_save_time:1306664339
bgrewriteaof_in_progress:0
total_connections_received:7566
total_commands_processed:151842950
expired_keys:0
evicted_keys:0
keyspace_hits:107550117
keyspace_misses:14122637
hash_max_zipmap_entries:64
hash_max_zipmap_value:512
pubsub_channels:0
pubsub_patterns:0
vm_enabled:1
role:slave
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
vm_conf_max_memory:8589934592
vm_conf_page_size:256
vm_conf_pages:419430400
vm_stats_used_pages:0
vm_stats_swapped_objects:0
vm_stats_swappin_count:0
vm_stats_swappout_count:0
vm_stats_io_newjobs_len:0
vm_stats_io_processing_len:0
vm_stats_io_processed_len:0
vm_stats_io_active_threads:0
vm_stats_blocked_clients:0
allocation_stats:2=51518892,6=1,8=22646776,9=84299373,10=90095763,11=87982998,12
=250680766,13=138274804,14=81987037,15=22645168,16=1745905291,17=832412,18=12851
5129,19=102732872,20=13882592,21=8812855,22=6177630,23=34541680,24=250871084,25=
37097608,26=9153951,27=36439842,28=28203110,29=41292448,30=8803399,31=21356776,3
2=4196906,33=1270799,34=859233,35=377999,36=9198662,37=9170534,38=6787743,39=191
7785,40=1955547,41=61543797,42=1138991,43=24520077,44=8836616,45=21727355,46=264
7811,47=197221944,48=2051516,49=3890596,50=1662424,51=6704474,52=972887,53=25880
00,54=13568958,55=4929678,56=84935887,57=2395572,58=1503424,59=7101657,60=772599
,61=3066338,62=1109211,63=1745568,64=1665070,65=5129407,66=1203733,67=9751622,68
=5747999,69=3176532,70=949205,71=1756731,72=1922718,73=1716864,74=1351982,75=586
9199,76=723826,77=7240456,78=546169,79=13434830,80=5928541,81=5994961,82=1622704
,83=13651424,84=959753,85=3163329,86=700824,87=1770241,88=876551,89=2139317,90=6
65539,91=4113337,92=930398,93=5450432,94=869837,95=6389973,96=1059695,97=2602933
,98=2307071,99=1958152,100=938757,101=1498332,102=756505,103=1894773,104=611156,
105=2565250,106=1826202,107=3148618,108=554464,109=1363401,110=528405,111=248329
3,112=8855142,113=4076227,114=681769,115=2773998,116=475665,117=793407,118=28268
5,119=1127102,120=403741,121=1175345,122=569492,123=2496963,124=665185,125=98742
6,126=370684,127=717977,128=5200342,129=1570947,130=625076,131=1962842,132=73970
5,133=1229267,134=490789,135=1808441,136=453583,137=1097844,138=350626,139=94514
2,140=319875,141=742856,142=441874,143=1202964,144=556018,145=771843,146=654819,
147=5885237,148=437469,149=4404280,150=330096,151=3526973,152=422738,153=6901419
,154=589142,155=3576540,156=308445,157=2793765,158=364259,159=2472412,160=788542
,161=1870771,162=421889,163=5032194,164=480833,165=1433453,166=515142,167=427516
5,168=400122,169=1492011,170=436845,171=1820935,172=395695,173=1761009,174=68961
2,175=1865548,176=1541010,177=1482874,178=458997,179=1920282,180=292539,181=2742
286,182=355745,183=2145951,184=552151,185=5106225,186=329086,187=1330095,188=263
503,189=1277417,190=239726,191=788729,192=546100,193=775056,194=450944,195=64656
5,196=342125,197=791019,198=324149,199=1014532,200=270133,201=759403,202=304050,
203=1201880,204=551188,205=1150245,206=358893,207=1250594,208=386949,209=1112466
,210=282689,211=1121234,212=323022,213=1430293,214=502148,215=1425293,216=256845
,217=1073619,218=244791,219=1458284,220=315367,221=1362675,222=250588,223=159471
7,224=318725,225=929032,226=382977,227=732288,228=510825,229=1025491,230=720858,
231=517497,232=353083,233=494538,234=490697,235=505269,236=840444,237=455152,238
=848740,239=833449,240=264145,241=798570,242=294188,243=724527,244=943349,245=68
4854,246=263169,247=1000450,248=176344,249=761662,250=182371,251=1191792,252=180
804,253=1021807,254=201698,255=959616,>=256=291645850
db0:keys=20005536,expires=0

If it is a crash, can you please paste the stack trace that you can find in
the log file or on standard output? This is really useful for us!
-

Please provide any additional information below.
00:00:01          CPU     %user     %nice   %system   %iowait    %steal     
%idle
00:10:01          all      0.00      0.00      0.00      0.00      0.03     
99.96
00:20:01          all      0.00      0.00      0.01      0.00      0.04     
99.95
00:30:01          all      0.00      0.00      0.00      0.00      0.03     
99.97
00:40:01          all      0.00      0.00      0.00      0.00      0.03     
99.97
00:50:01          all      0.00      0.00      0.00      0.00      0.03     
99.97
01:00:01          all      0.00      0.00      0.00      0.00      0.02     
99.97
01:10:01          all      0.00      0.00      0.00      0.00      0.02     
99.97
01:20:01          all      0.00      0.00      0.00      0.00      0.02     
99.97
01:30:01          all      0.00      0.00      0.00      0.00      0.02     
99.98
01:40:01          all      0.00      0.00      0.00      0.00      0.02     
99.98
01:50:01          all      0.00      0.00      0.00      0.00      0.02     
99.98
02:00:01          all      0.00      0.00      0.00      0.00      0.02     
99.98
02:10:01          all      0.05      0.00      0.02      0.00      0.12     
99.81 -- deletes started
02:20:01          all      0.60      0.00      0.04      0.00      0.15     
99.20
02:30:01          all      1.45      0.00      0.03      0.00      0.08     
98.44
02:40:01          all      2.56      0.00      0.03      0.00      0.06     
97.34
02:50:01          all      2.90      0.00      0.03      0.00      0.03     
97.04
03:00:01          all      2.91      0.00      0.02      0.00      0.02     
97.05
03:10:01          all      3.83      0.00      0.03      0.00      0.02     
96.11
03:20:01          all      4.02      0.00      0.03      0.00      0.01     
95.94
03:30:01          all      4.49      0.00      0.03      0.00      0.01     
95.47
03:40:01          all      4.83      0.00      0.03      0.00      0.02     
95.12
03:50:01          all      5.15      0.00      0.03      0.00      0.01     
94.81
04:00:01          all      5.82      0.00      0.03      0.00      0.01     
94.14
Average:          all      1.61      0.00      0.02      0.00      0.04     
98.34

Original issue reported on code.google.com by gauravk...@gmail.com on 13 Jun 2011 at 8:12

GoogleCodeExporter commented 8 years ago

Forgot to mention that the sar snapshot was taking during the execution of the 
deletes. CPU usage will spike upto 50/60% in the morning when the traffic to 
the website returns.

Thanks!

Original comment by gauravk...@gmail.com on 13 Jun 2011 at 8:43

GoogleCodeExporter commented 8 years ago

Can you be more specific about "all hell breaks loose"? The stats output shows 
1%-5% CPU usage over ~2 minutes, which isn't a whole lot. You mention a ~25ms 
latency when the DELs are being fed to the slave. Can you specify how fast you 
are issuing those deletes (ops/sec)? Do the machines running the master and the 
slave have identical hardware, and are they 100% dedicated to Redis?

Original comment by pcnoordh...@gmail.com on 13 Jun 2011 at 9:27

GoogleCodeExporter commented 8 years ago

The "all hell breaks loose" is what it means for the business that we are 
supporting using redis :) The tolerated latency from redis is < 10ms.

Answers to your specific questions:
- There is no latency between master and slave during the deletes.
- The slave response times start deteriorating going to ~20-30ms and becomes 
worse even after the deletes on the master have stopped.
- The deletes are being issued at whatever rate the master can support it, 
which is typically 1 delete/millisecond.
- The master and slave are dedicated redis servers. Each run 2 instances of 
redis and we have 6 such boxes.
- The masters are running on physical hardware boxes v/s slaves are virtual EC2 
instances.
- Master is running CentOS and EC2 is running FC8.
- Similar CPU/Memory configuration.

- Before deletes the slaves operate at a higher efficiency than master as noted 
in the report. They typically handle 3-4X read volumes and average latency is 
1ms on the masters as well as slaves.
- Master and slaves running identical configuration.

Original comment by gauravk...@gmail.com on 13 Jun 2011 at 1:39

GoogleCodeExporter commented 8 years ago

It is very hard to reason about this without more specifics. Can you tell what 
kind of EC2 instances you are using (most important: the amount of RAM), and 
what kind of cpu/memory config you have on the masters?

Does the latency spike disappear after some time? This could lead towards the 
replication feed saturating the network link (unlikely when you are only doing 
DEL and have reasonably small keys).

On the slave you have the VM enabled, while it is not storing anything. If this 
is not strictly necessary, I highly recommend disabling it, even if only for 
ruling out possible causes for the increased latency.

Are there other things happening as far as you know when the stream of DELs is 
being replicated? For instance: if there is a BGSAVE going on and you are near 
the physical memory limit of that machine, COW will kick in and the machine may 
start swapping (which leads to another question: do the slaves have a swap 
partition at all?).

With more information, such as these things, we might be able to corner the 
cause of this issue.

Cheers,
Pieter

Original comment by pcnoordh...@gmail.com on 13 Jun 2011 at 2:03

GoogleCodeExporter commented 8 years ago

The EC2 instance is what is termed as a high memory instance, specifically we 
are running the slaves on a "Double Extra Large" instance: 
http://aws.amazon.com/ec2/#instance. These have 34G RAM.

This is uname -a output
- Linux uds-app-server-1 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:34:28 EST 2008 
x86_64 x86_64 x86_64 GNU/Linux

This is from /proc/version
Linux version 2.6.21.7-2.fc8xen (mockbuild@xenbuilder4.fedora.phx.redhat.com) 
(gcc version 4.1.2 20070925 (Red Hat 4.1.2-33)) #1 SMP Fri Feb 15 12:34:28 EST 
2008

The latency spike did not disappear after 12 hours which is how I noticed it in 
the first place. The slave has fully caught up with the master as per info 
output and the RECQ in netstat output is empty.

VM is enabled to ensure that we don't have data loss. I am aware of the peril 
of using the redis VM feature so we maintain data by keeping only the recently 
created data in redis. I could experiment by starting a second slave with VM 
disabled. 

Nothing else is happening. No BGSAVE, no BGREWRITE, nothing, no new activity. 
The vmstat output shows no swapping activity. The slaves don't have a swap 
partition but under these conditions this should not matter.

This seems specifically something that is going wrong after the deletes. I 
cannot rule out the fact that the OS is different on the slave or for that 
matter the virtual nature of the server. 

Let me know if you need more information. I may be able to run some experiments 
on a second slave but only during the low peak hours, that is 9PM US EST.

Thanks,
Gaurav.

Original comment by gauravk...@gmail.com on 13 Jun 2011 at 2:23

GoogleCodeExporter commented 8 years ago

Pieter,

I was able to make some modifications to the redis instance and do a 
comprehensive benchmarking of the problem.

To recall
- We have a several redis instances running in a master slave combination
- Slaves are running on EC2
- After doing deletes on the master the seek times on the slave increase upto 8 
fold
- Masters and Slaves are running with redis virtual memory configuration
- We pretty much use hashes throughout the application. The only three 
operations used are hmset, hgetall and hmget (not mentioned earlier)

On your recommendation comment #4, I created a slave with no VM. The 
performance degradation was 2 to 3 fold and less than with the slaves with VM 
enabled.

I turned off replication on this slaves and did some benchmarking. This slave 
was now out of rotation from serving any web traffic too, so this slave was 
completely idle without any writes and reads.

The before state of the instance is when about 10% of 20 million keys had been 
deleted. The after state of the instance is after restarting the slave from a 
save file.

First benchmark: I took about 3200 keys in the database and ran hgetall on them.
http://pastebin.com/cWgHvTWu

Second benchmark: Ran redis_benchmark binary with the following options
./redis-benchmark -n 50 -p 6382 -n 1000 -d 1500 -k 1 
http://pastebin.com/C8Y6UtwT

Third benchmark: Ran redis_benchmark binary with the following options
./redis-benchmark -n 50 -p 6382 -n 10000 -d 150 -k 1 
http://pastebin.com/U1h1Pw1B

Please let me know if you would like to get any more data. This seems like a 
problem that should be happening fairly commonly.

Please help.

Thanks,
Gaurav.

Original comment by gauravk...@gmail.com on 27 Jun 2011 at 8:17

GoogleCodeExporter commented 8 years ago

I am able to reproduce the behavior consistently. At this time I am trying to 
run the program with gprof to figure out where the additional time is being 
spent. For some reason the gprof output does not contain any timing data. All 
cumulative seconds, self seconds, etc are reported as 0s.

Here is what I am doing:
- make gprof install
- redis-server redis.conf
- redis-cli < hgetAll-for-1Million-keys.txt
- redis-cli shutdown

The gmon.out does not contain any timing data. Is there any other way to make 
the server shutdown cleanly?

Original comment by gauravk...@gmail.com on 9 Jul 2011 at 12:32

GoogleCodeExporter commented 8 years ago

Gaurav, I cannot comment specifically about your issue,
but I think your expectations are too high ...

>> The tolerated latency from redis is < 10ms.

10ms is close to the operating system scheduler time
slice. On a physical machine, it would be difficult to
guarantee that the *maximum* latency time is 10ms
without using a low latency kernel.

On a VM, it even harder (if not impossible).

If the issue on the slaves occurs only after deletions,
I would suggest to try compiling Redis against tcmalloc
or jemalloc to reduce memory fragmentation. 

Regards,
Didier.

Original comment by didier...@gmail.com on 10 Jul 2011 at 8:19

GoogleCodeExporter commented 8 years ago

Didier,

Thanks for the response.

I disagree. It is not unreasonable to expect sub-millisecond response from a 
in-memory database.

Your last comment seems to suggest performance degradation on a virtual machine 
due to memory fragmentation. Can you explain this a bit more? Why should *read* 
performance degrade due to memory fragmentation. I can understand write 
performance degrading due to fragmentation. I can easily compile using the 
tc/je malloc branch and see if what you say works. 

I would love to dig into the code and figure out what is going on. I found some 
links on the redis discussion boards and am going thru it. In the meanwhile it 
would be great to get some perf numbers from gperf.

Anyone know what I am doing wrong with gperf? I get the call graph, etc. in the 
gmon.out but no timing counts.

Thanks,
Gaurav.

Original comment by gauravk...@gmail.com on 10 Jul 2011 at 9:46

GoogleCodeExporter commented 8 years ago


On a VM, the locality of the data in memory is more critical
to get good performance than on a physical machine. The hypervisor
(or the host) can do plenty of fancy tricks with the guest memory
pages (deduplication, ballooning, swapping, etc ...) even if the
guest OS is not swapping memory. 

EC2 uses Xen and not VMWare but you may find the following
paper useful: http://j.mp/qkC3hs

I admit it is pure speculation, and it depends on your access
pattern, but we may imagine that recently added keys are stored
in the same pages (nicely grouped in memory), until memory
gets fragmented by the delete operations. Additional keys
are then distributed randomly. When locality is bad, the
probability to experience latency for read operations gets
higher.

Regarding gprof, you need to keep in mind that due to the low
sampling rate (.01 s), it is difficult to get meaningful
information from it. You need to accumulate of lot of activity
in the Redis server process before getting non nul timings ...
1 million hgetall is enough on my (slow) machine though ...

You will probably be better served by oprofile,
or callgrind if you can support the performance overhead.

If you still want to use gprof, make sure to compile with
make PROF="-pg" V=on ; and check that -pg is correctly
used on all command lines. Also deactivate bgsave
(since the gmon.out file tends to be trashed by forked
child processes), and avoid to launch client commands
in the same directory as the gmon.out file generated by
the server (for the same reason).

Regards,
Didier.

Original comment by didier...@gmail.com on 10 Jul 2011 at 12:57

GoogleCodeExporter commented 8 years ago

Didier,

Thanks for your explanation. That does make sense.

Looks like the redis-server using malloc as the mem_allocator is causing 
fragmentation that is leading to the degradation in performance. I ran some 
simple tests to time hgetall for about 1 million keys in the db before and 
after deleting 1 million other keys. The tests are fairly simple

- Start a redis-server to load from a saved DB file having 20 million keys
- info
- time hgetall for 1 million keys
- time hgetall + del for 1 million keys not in the fetch set
- time hgetall for the same 1 million keys from above
- info
- shutdown

Initial fetchs take about the same time for all three allocators. The deletes 
take much less time for the server using jemalloc and tcmalloc. The subsequent 
hgetall on the original 1 million keys took the exact same time as before for 
je/tc malloc. The delete and the subsequent gets took 2 to 2.5 times longer 
when the redis-server was using libc as the mem_allocator.

Here are the captured output from the tests:
server using malloc: http://pastebin.com/B6NGgYhD
server using tcmalloc: http://pastebin.com/10nc0XWw
server using jemalloc: http://pastebin.com/L0BNaf0X

This should solve the problem we are having, although the access patterns in 
production are slightly different.

Do you know if jemalloc is ever going to be merged into the 2.2 branch. It 
seems like Salvatore has some significant code fixes in the 2.2.9 and greater 
releases that the 2.2.jemalloc branch seems to be missing.

Thanks again.
Gaurav.

Original comment by gauravk...@gmail.com on 11 Jul 2011 at 10:59

GoogleCodeExporter commented 8 years ago

According to a recent mail from Salvatore:
"jmalloc is the default in Redis 2.4, while the 2.2-jemalloc 
branch will no longer be supported as 2.4 is entering release 
candidate soon." so I don't think the jemalloc stuff will be
backported to the main 2-2 branch.

Regards,
Didier.

Original comment by didier...@gmail.com on 12 Jul 2011 at 3:35

Lachim / redis

Seeks times increasing on redis slave after executing large scale deletes on master #584