Lachim / redis

Automatically exported from code.google.com/p/redis
2 stars 0 forks source link

100% CPU usage,Server hangs #648

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What version of Redis you are using, in what kind of Operating System?
 Redis 2.2.11 ,in SUSE Linux (x86_64)

What is the problem you are experiencing?
1.Redis hangs ,can connect with redis-cli(localhost),but can't process any 
command like "info",etc..
2.Redis run on port 6666,lsof -i :6666 shows 
 redis-ser 32193 root    4u  IPv4  794918000       TCP 192.168.1.1:6666 (LISTEN)
 redis-ser 32193 root    5u  IPv4 1089726413       TCP 192.168.1.1:6666->192.168.1.1:36121 (CLOSE_WAIT)
3."strace -p" keeps showing "Process 32193 attached - interrupt to quit"
4.Memory usage is almost 6G , CPU usage stays 100%

What steps will reproduce the problem?
1.Run redis (vm disabled,max_memory set to 28G,etc)
2.whenever memory usage raised to almost 6G or CPU usage raised to 100%

Original issue reported on code.google.com by zyfu...@gmail.com on 29 Aug 2011 at 2:47

GoogleCodeExporter commented 8 years ago
Please attach with a debugger to the Redis server and dump the stack
(bt command with gdb).

Regards,
Didier.

Original comment by didier...@gmail.com on 29 Aug 2011 at 8:41

GoogleCodeExporter commented 8 years ago
#0  0x00002ae0233bf490 in _int_malloc () from /lib64/libc.so.6
#1  0x00002ae0233c11b6 in malloc () from /lib64/libc.so.6
#2  0x0000000000410e97 in zmalloc ()
#3  0x0000000000416ab2 in createClient ()
#4  0x0000000000416c45 in ?? ()
#5  0x0000000000416d6f in acceptTcpHandler ()
#6  0x000000000040ab5f in aeProcessEvents ()
#7  0x000000000040ad5e in aeMain ()
#8  0x000000000040f97d in main ()

Original comment by zyfu...@gmail.com on 29 Aug 2011 at 10:24

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
Thanks.

Unfortunately, the stack shows a probable infinite loop
in the memory allocator. It means the memory got
corrupted before a connection accept operation, but
not much can be said about the root cause of the
corruption.

Regards,
Didier.

Original comment by didier...@gmail.com on 29 Aug 2011 at 12:30

GoogleCodeExporter commented 8 years ago
When you say the memory usage increased to 6G, do you mean 6G as per output of 
INFO, or 6G in RSS? If it is 6G RSS, there should be enough memory on the 
machine to satisfy allocation requests, unless there are other active processes 
that require a lot of memory, or as Didier says, the memory is corrupted.

Once Redis hits 100% CPU and hangs, is the stack trace consistently 
reproducible?

Original comment by pcnoordh...@gmail.com on 30 Aug 2011 at 5:06

GoogleCodeExporter commented 8 years ago
I suggest taking multiple stack traces with gdb + bt command so that we can 
understand if it is a loop in malloc (not impossible but uncommon) or if it is 
something at higher level.

Thanks!

Original comment by anti...@gmail.com on 14 Sep 2011 at 1:22

GoogleCodeExporter commented 8 years ago
Hi there.

I too am seeing periodic 100% CPU usage in my redis-server process.  

My recent gdb bt:
>>>
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0xffffffffff60043b in ?? ()
(gdb) bt
#0  0xffffffffff60043b in ?? ()
#1  0x00000000024002a0 in ?? ()
#2  0x00007fca4e518560 in ?? () from /lib/libc.so.6
#3  0x0000000000411d7c in redisLog (level=3, fmt=0x449f08 "Accepting client 
connection: %s") at redis.c:203
#4  0x000000000041ae96 in acceptTcpHandler (el=<value optimized out>, fd=<value 
optimized out>, privdata=<value optimized out>, mask=<value optimized out>) at 
networking.c:436
#5  0x000000000040d5f3 in aeProcessEvents (eventLoop=0x7fca4d847000, 
flags=<value optimized out>) at ae.c:342
#6  0x000000000040d83e in aeMain (eventLoop=0x7fca4d847000) at ae.c:387
#7  0x0000000000412f37 in main (argc=2, argv=<value optimized out>) at 
redis.c:1734
>>>

For what it's worth, this is a redis master, that is configured not to flush to 
disk at all (this responsibility has been shifted to the slave).  

Cheers.

Original comment by findch...@gmail.com on 18 Apr 2012 at 1:06

GoogleCodeExporter commented 8 years ago
One more backtrace that I think I caught while CPU was at 100%:
>>>

0x00007fca4e596d03 in epoll_wait () from /lib/libc.so.6
(gdb) bt
#0  0x00007fca4e596d03 in epoll_wait () from /lib/libc.so.6
#1  0x000000000040d503 in aeApiPoll (eventLoop=0x7fca4d847000, flags=<value 
optimized out>) at ae_epoll.c:70
#2  aeProcessEvents (eventLoop=0x7fca4d847000, flags=<value optimized out>) at 
ae.c:330
#3  0x000000000040d83e in aeMain (eventLoop=0x7fca4d847000) at ae.c:387
#4  0x0000000000412f37 in main (argc=2, argv=<value optimized out>) at 
redis.c:1734
>>>

Hope that helps.
-Chris

Original comment by findch...@gmail.com on 18 Apr 2012 at 1:10

GoogleCodeExporter commented 8 years ago
Here is a long sampling using the Poor Man's Debugger, while CPU is pegged at 
100%:
https://gist.github.com/2415544

-Chris

Original comment by findch...@gmail.com on 18 Apr 2012 at 6:17

GoogleCodeExporter commented 8 years ago
Hi, everything looks ok from the trace. What is the output of INFO? Maybe you 
see it saving when the slave disconnects? Redis logs near the moment the CPU 
reaches 100%? Does it slow down when this happens or it serves connections just 
fine?

Cheers,
Salvatore

Original comment by anti...@gmail.com on 19 Apr 2012 at 7:43

GoogleCodeExporter commented 8 years ago
Thanks for looking into this.

When the CPU is pegged at 100% , running `redis-cli info` hangs.  No response 
until the CPU usage returns to normal.

Again, this is the output from the Poor Man Debugger during this period:
>>>
root@domU-12-31-39-14-09-3D:~# /usr/local/bin/poor.sh 
      4 
      2 pthread_cond_wait@@GLIBC_2.3.2,bioProcessBackgroundJobs,start_thread,clone,??
      1 ??,??,??,redisLog,acceptTcpHandler,aeProcessEvents,aeMain,main
root@domU-12-31-39-14-09-3D:~# redis-cli info
^C
root@domU-12-31-39-14-09-3D:~# /usr/local/bin/poor.sh 
      4 
      2 pthread_cond_wait@@GLIBC_2.3.2,bioProcessBackgroundJobs,start_thread,clone,??
      1 ??,??,??,redisLog,acceptTcpHandler,aeProcessEvents,aeMain,main
root@domU-12-31-39-14-09-3D:~# redis-cli info
^C
>>>

Cheers,
Chris

Original comment by findch...@gmail.com on 19 Apr 2012 at 8:20

GoogleCodeExporter commented 8 years ago
One other thing:  I have not seen a new log entry since I told redis to stop 
flushing to disk with this command:  `redis-cli config set save ""`.  Should 
there be?

The only log entries prior would look like:
>>
[18350] 13 Apr 14:26:15 * 100 changes in 30 seconds. Saving...
[18350] 13 Apr 14:26:15 * Background saving started by pid 21107
[21107] 13 Apr 14:26:37 * DB saved on disk
[18350] 13 Apr 14:26:37 * Background saving terminated with success
>>

-Chris

Original comment by findch...@gmail.com on 19 Apr 2012 at 8:24

GoogleCodeExporter commented 8 years ago
And here's a dump of INFO (when not pegged at 100%):  
https://gist.github.com/2424421

Original comment by findch...@gmail.com on 19 Apr 2012 at 9:46

GoogleCodeExporter commented 8 years ago
I'm getting 100% cpu usage once or twice a day.

Every time it happened I get a 'DB saved on disk' message without a 'Background 
saving started by pid ???'

* Happened with v2.2.7 and after I upgraded to v2.4.11
* It didn't used to happen.  It started after I loaded up a big hash of about 
540k+ items in one key.
* Recovers itself after an hour, doesn't crash.
* But redis is very slow or cannot connect while it's happening.

Below are the logs, notice that pid 9668 was never started.

[15126] 28 Apr 08:01:15 * Background saving terminated with success
[15126] 28 Apr 08:16:16 * 10000 changes in 900 seconds. Saving...
[15126] 28 Apr 08:16:16 * Background saving started by pid 28646
[28646] 28 Apr 08:16:17 * DB saved on disk
[15126] 28 Apr 08:16:17 * Background saving terminated with success
[15126] 28 Apr 09:00:20 * 10000 changes in 900 seconds. Saving...
[15126] 28 Apr 09:00:20 * Background saving started by pid 334
[334] 28 Apr 09:00:20 * DB saved on disk
[15126] 28 Apr 09:00:20 * Background saving terminated with success
[15126] 28 Apr 09:15:21 * 10000 changes in 900 seconds. Saving...
[15126] 28 Apr 09:15:21 * Background saving started by pid 3474
[3474] 28 Apr 09:15:21 * DB saved on disk
[15126] 28 Apr 09:15:21 * Background saving terminated with success
[9668] 28 Apr 10:15:22 * DB saved on disk
[15126] 28 Apr 11:15:23 * 10 changes in 3600 seconds. Saving...
[15126] 28 Apr 11:15:23 * Background saving started by pid 14775
[14775] 28 Apr 11:15:23 * DB saved on disk
[15126] 28 Apr 11:15:23 * Background saving terminated with success
[15126] 28 Apr 12:15:24 * 10 changes in 3600 seconds. Saving...
[15126] 28 Apr 12:15:24 * Background saving started by pid 21453
[21453] 28 Apr 12:15:24 * DB saved on disk

Original comment by nik...@gmail.com on 28 Apr 2012 at 5:18

GoogleCodeExporter commented 8 years ago
I've figured out my problem.  I had too many connections and it ran out of file 
descriptors.

In ae_epoll.c:aeApiPoll, it calls "epoll_wait" but when epoll_wait returns an 
error it doesn't get printed or anything, just returns.  So it'll get called 
again in an endless loop using up 100% of the cpu.

Original comment by nik...@gmail.com on 1 May 2012 at 1:34

GoogleCodeExporter commented 8 years ago
Hi there.

How did you figure this out, and how did you resolve it?

Cheers.

Original comment by ch...@socialvibe.com on 1 May 2012 at 8:50

GoogleCodeExporter commented 8 years ago
Run 
sudo netstat -nap | grep redis

It should only print a few lines with your connections, not hundreds or 
thousands.

It was a problem with my program not redis, but redis should really print an 
error message instead of going into an endless loop.

Original comment by nik...@gmail.com on 1 May 2012 at 10:23

GoogleCodeExporter commented 8 years ago
@niknah: this problem is no longer present in Redis 2.6, now available filedes 
are checked at startup and max clients number enforced as needed. Cheers.

p.s. please report any further issue using the Github issue system, we no 
longer use this one. Thanks.

Original comment by anti...@gmail.com on 1 May 2012 at 10:32

GoogleCodeExporter commented 8 years ago
@niknah Looks like when my CPU was hitting 100%, the number of connection was 
high.  

# sudo netstat -nap | grep redis |wc -l
1020

So:
(1) @antirez:  When you say it's not a problem in 2.6, does that imply that 
connections beyond max_clients are closed automatically to make another 
connection available?
(2) @niknah - What exactly did you change in your code to fix this?  I most 
likely have the same issue, and I'll try to apply your fix.

Cheers!

Original comment by ch...@socialvibe.com on 1 May 2012 at 11:05

GoogleCodeExporter commented 8 years ago
The new connections will return an error, when you've reached the maximum 
connections.

I am using nodejs, for every connection you have to send 'quit' when you're 
finished.  Like...

try {
   redis.createClient();
} finally {
   redis.quit();
}

But if you've stored it in an object you'd have to be careful and close it 
whenever the object is removed.

Original comment by nik...@gmail.com on 2 May 2012 at 1:15

GoogleCodeExporter commented 8 years ago
We've hit connection problems here as well.  Be aware that the common linux 
configuration is to limit # of connections to 1024 (see ulimit -n).  On top of 
that, Redis has a built in hard limit of 10240, which we removed on our end.

When you run out of FDs, Redis keeps trying to grab them and the CPU shoots up 
to 100%.

Original comment by ma...@pinterest.com on 2 May 2012 at 1:30

GoogleCodeExporter commented 8 years ago
Thanks so much for the help guys.

This was definitely related to the FD limit.  Started redis-server with `ulimit 
-n 10240`, and no more CPU @ 100%.  Great stuff.

I think I will need to ensure my connections get closed properly.  I am using 
redis-rb with passenger.

Original comment by ch...@socialvibe.com on 2 May 2012 at 1:57

GoogleCodeExporter commented 8 years ago
This is the new behavior of Redis 2.6:

1) At startup it checks the current maxclients setting (by default set to 
10000), then it tries to set the file descriptor limit for the process to 
maxclients+32 (at max 32 more filedes are used for internal usage).

2) If it is not able to set to request as many filedes as maxclients is asking 
for, then it tries to find what is the limit for the process, and set the 
maxclients directive to the limit minus 32.

3) So now maxclients is always active, AND matches the actual process limit.

4) If maxclient is reached, the newly connected client receives an error and 
the connection is closed.

Cheers,
Salvatore

Original comment by anti...@gmail.com on 2 May 2012 at 9:44

GoogleCodeExporter commented 8 years ago
Thanks for the clarification Salvatore.

So when do connections get closed?  Obviously the client can explicitly close 
the connection, but is there a timeout after which redis closes old 
connections, and if so, where is that timeout configured?

BTW, I think we can change the status of this ticket from New->Resolved.

Cheers.

Original comment by ch...@socialvibe.com on 2 May 2012 at 7:42

GoogleCodeExporter commented 8 years ago
As per the configuration file:
# Close the connection after a client is idle for N seconds (0 to disable)
timeout 0

Original comment by josiah.c...@gmail.com on 2 May 2012 at 10:35