Inconsistent response times when querying server

GoogleCodeExporter commented 8 years ago

Hi, we are having problems both in production and development environments 
relating to Redis server response times.  If issue a simple command to get a 
key multiple times with the redis-cli client, I get very different response 
times.  While most of the replies take 1ms or less, some take 6, 20, 30 and up 
to 450ms.  

Over 5 milliseconds is way to much, as for example a web script might issue 
more than 100 requests to a Redis server to serve a single page.  The client 
and server are the same machine, so no network latency is present.  Tested on 
high end production servers with moderate use (500 clients) and a local server 
with about 10clients.

Tested redis servers: 2.2.2 and 2.0.1
Platform: Debian Linux 2.6.32-5-amd64

To reproduce:

watch -n1 "time redis-cli get ANY_key"

and check the varying response times, it doesn't seem to be related with 
background saves.

As a workaround, every time we query a redis server, we now use a very small 
time out (5ms) and retry in case of a timeout (e.g. 5 times). From the time we 
have put this into production we haven't got any timeouts failing over 5 times. 
 However, if we just leave one time out to 200ms, we get multiple timeout 
errors per minute.

Any ideas what can be causing this? know issue?

At least some one else reports something similar: 
http://groups.google.com/group/redis-db/browse_thread/thread/b490bb7b57f7ba95

Original issue reported on code.google.com by npm...@gmail.com on 4 Apr 2011 at 2:38

GoogleCodeExporter commented 8 years ago

As a preliminary remark, Linux is not a real-time operating system.
When synchronous IPCs are done between clients and servers, there is
absolutely no guarantee that the kernel scheduler will enforce
sub-millisecond latency for ALL roundtrips. On the contrary, the
average latency may be quite good while the maximum latency is
quite bad. This is not specific to Redis ...

Now, there are some factors that can make this situation even worse.
For instance:
   - if you use Redis VM
   - if your machine swaps (at the OS level)
   - if CPU consumption is significant
   - if all your 500 clients send queries to Redis at the same exact time.

Some remarks:

>> a web script might issue more than 100 requests to a Redis server

Your script should not perform 100 synchronous accesses to Redis,
but rather pipeline the queries. You would probably get much better
response times this way.

>> watch -n1 "time redis-cli get ANY_key"

It seems a poor way to measure latency: the cost of forking and
launching redis-cli is probably higher than the roundtrip you are
trying to measure.

Regards,
Didier.

Original comment by didier...@gmail.com on 4 Apr 2011 at 4:38

GoogleCodeExporter commented 8 years ago

Thanks for the reply Didier, I understand this is not a perfect benchmark, but 
wanted to report this issue for other people.

One thing though, I run the same test on mysql and get a much smoother RT, 
between 10 and 15ms over the network.  The max I have seen was 22ms for the 
same "Select time();" query.  While on redis, response time varies greatly. 

We have VM enabled, but the info client reports it's not being used, CPU 
utilization is below 10%, we pipeline queries to redis for secuential parts of 
the code, but different parts can issue their own.

What do you think of the time out-retry estrategy? Having timeouts at 5ms and 
retrying 5 times, seems to be much better than having 200ms as a timeout.  We 
use such small timeouts, as a Redis server might be down, and we don't want 
requests to be waiting for it (when we use it a caching system).  We also had 
the case of a "dead" server, that kept the Redis port open, but didn't replied 
to any queries until ir was restarted, and each connection waited until the max 
time out.

Thanks for your comments

Original comment by npm...@gmail.com on 4 Apr 2011 at 6:00

GoogleCodeExporter commented 8 years ago

Hi again,

sorry, I can only offer speculations (to be taken with a grain
of salt).

MySQL spawns one thread per connection. With queries such as
"select time()", there is no contention since no real data is
accessed. The threads are therefore very responsive and because
they are distributed on all the cores, response time variation
is limited.

Redis, on the other hand, runs in one thread. All the queries are
serialized. So if you have 10 queries whose processing time is
1 ms, response time for the first query will be around 1 ms, but
response time for the last one will be at least 10 ms.

When there is no contention, I would say a single-threaded server
tends to generate more a volatile response time than a
one-thread-per-connection server.

>> We have VM enabled, but the info client reports it's not being used

If it is not used, why not trying to start Redis with no VM at all?
At least you will validate whether the VM has an impact or not.

>> What do you think of the time out-retry estrategy?

I'm rather surprised you have good results with a 5 ms timeout.
It is really close to the typical kernel scheduler time slice.
I'm probably too old and too conservative, but I never use
communication timeouts below 2 seconds on my Unix/Linux systems.

If you suspect scheduling issues, you may want to try to isolate
Redis on its own core (taskset, numactl), or run Redis with a 
real-time priority under SCHED_RR or SCHED_FIFO (chrt). Be sure
to deactivate bgsave before trying this.

Regards,
Didier.

Original comment by didier...@gmail.com on 4 Apr 2011 at 11:09

Lachim / redis

Inconsistent response times when querying server #510