Supervisor / supervisor

Supervisor process control system for Unix (supervisord)
http://supervisord.org
Other
8.48k stars 1.25k forks source link

100% CPU usage (maybe caused by new poll implementation?) (help wanted) #807

Open timonwong opened 8 years ago

timonwong commented 8 years ago

We were using supervisor 3.3.0 on Ubuntu 14.04 LTS.

Recently on some of our nodes on AWS, we spot very high cpu usage about supervisord, to get around that, we have to reload it, but it may happen again in one day.

Reading from strace, we spot there are excessive calls to both 'gettimeofday' and 'poll', so after that, we have to choose to downgrade supervisor to 3.2.3.

I see there was #581, but I think it's irrelevant here, our wild guess is it just caused by the new poll implementation introduced in 3.3.0 (and maybe caused by simultaneous log outputs?)...

Thanks in advance!

mnaberez commented 8 years ago

Recently on some of our nodes on AWS, we spot very high cpu usage about supervisord, to get around that, we have to reload it, but it may happen again in one day.

@hathawsh also saw high CPU usage from the new poller implementation and submitted #589. That was merged before 3.3.0 was released.

Reading from strace, we spot there are excessive calls to both 'gettimeofday' and 'poll', so after that, we have to choose to downgrade supervisor to 3.2.3.

It's possible this is the main loop spinning since poller.poll() does a poll and below it tick() does a gettimeofday.

Can you run 3.3.0 with loglevel=blat? That will produce as much debugging information as possible. Perhaps it will have some clues to why this is happening.

and maybe caused by simultaneous log outputs?

There weren't any changes to logging between 3.2.3 and 3.3.0. Since you said 3.2.3 works for you, I wouldn't suspect logging.

timonwong commented 8 years ago

Hi! Today I spot the problem again, here gz'd strace log (duration ~1s): strace.log.tar.gz And here is fds:

0 -> /dev/null
1 -> /dev/null
10 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
100 -> pipe:[32356299]
101 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
102 -> pipe:[32356298]
103 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
104 -> pipe:[32356300]
105 -> /data/supervisor_log/bg_task_00-stderr.log
106 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
107 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
108 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
109 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
11 -> pipe:[3024]
110 -> /data/supervisor_log/bg_task_00-stdout.log
111 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
112 -> /data/supervisor_log/bg_task_00-stderr.log
113 -> pipe:[32356302]
114 -> pipe:[32356301]
115 -> pipe:[32356303]
116 -> /data/supervisor_log/bg_task_00-stdout.log.1
117 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
118 -> /data/supervisor_log/bg_task_00-stderr.log
119 -> /data/supervisor_log/bg_task_00-stdout.log.8
12 -> pipe:[17721748]
121 -> pipe:[32356304]
124 -> /data/supervisor_log/bg_task_00-stderr.log
13 -> pipe:[17721749]
14 -> pipe:[11775]
15 -> pipe:[3025]
16 -> pipe:[11776]
17 -> /data/supervisor_log/bg_task_02-stdout.log
18 -> /data/supervisor_log/bg_task_02-stderr.log
19 -> pipe:[17721751]
2 -> /dev/null
20 -> pipe:[11777]
21 -> pipe:[17721750]
22 -> /data/supervisor_log/bg_task_03-stdout.log
23 -> /data/supervisor_log/bg_task_03-stderr.log
24 -> pipe:[11827]
25 -> pipe:[17721752]
26 -> pipe:[11828]
27 -> /data/supervisor_log/bg_task_01-stdout.log.1
28 -> /data/supervisor_log/bg_task_01-stderr.log.2
29 -> pipe:[17721754]
3 -> /data/supervisor_log/supervisord.log
30 -> pipe:[11829]
31 -> pipe:[17721753]
32 -> /data/supervisor_log/bg_task_04-stdout.log
33 -> /data/supervisor_log/bg_task_04-stderr.log
34 -> pipe:[17721755]
35 -> /data/supervisor_log/bg_task_01-stdout.log
36 -> /data/supervisor_log/bg_task_01-stderr.log.1
37 -> pipe:[17721757]
38 -> pipe:[17721756]
39 -> pipe:[17721758]
4 -> socket:[13073]
40 -> /data/supervisor_log/bg_task_01-stdout.log.3
41 -> /data/supervisor_log/bg_task_01-stderr.log
42 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
43 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
44 -> pipe:[17721759]
45 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
46 -> pipe:[17719642]
47 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
48 -> pipe:[17719643]
49 -> /data/supervisor_log/bg_task_01-stdout.log.2
5 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
50 -> pipe:[17719644]
51 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
52 -> /data/supervisor_log/bg_task_01-stderr.log.3
53 -> /data/supervisor_log/bg_task_05-stdout.log
54 -> /data/supervisor_log/bg_task_05-stderr.log
55 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
56 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
57 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
58 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
59 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
6 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
60 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
61 -> pipe:[30456289]
62 -> pipe:[30456290]
63 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
64 -> /data/supervisor_log/node_exporter.log
65 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
66 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
67 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
68 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
69 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
7 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
70 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
71 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
72 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
73 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
74 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
75 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
76 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
77 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
78 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
79 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
8 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
80 -> /data/supervisor_log/bg_task_00-stderr.log
81 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
82 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
83 -> /data/supervisor_log/bg_task_00-stderr.log
84 -> /data/supervisor_log/bg_task_00-stdout.log.2
85 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
86 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
87 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
88 -> /data/supervisor_log/bg_task_00-stderr.log
89 -> /data/supervisor_log/bg_task_00-stderr.log
9 -> pipe:[3023]
90 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
91 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
92 -> /data/supervisor_log/bg_task_00-stdout.log.8
93 -> pipe:[32356293]
94 -> pipe:[32356294]
95 -> pipe:[32356296]
96 -> pipe:[32356295]
97 -> pipe:[32356297]
98 -> /data/supervisor_log/bg_task_00-stdout.log.3
99 -> /data/supervisor_log/bg_task_00-stderr.log

I haven't open the blather level yet, I'll try it now, and will see what's going on.

timonwong commented 8 years ago

Hi, me again, here is the supervisord log with blather level enabled: supervisord.log.tar.gz

blusewang commented 7 years ago

Me too.every times when i login with http and restart a program.my server's CPU immediately go to 100%. I wasted so much time on this bug! I used it to manage my laravel queue. I hate it, but I have no choose!

gonesurfing commented 7 years ago

This is 100% reproducible on 3.3.1 (anaconda package on OSX). -Start supervisord with http server on localhost:9001 -open new ipython console and import xmlrpclib and type server = xmlrpclib.Server('http://localhost:9001') -now run any server.getMethod and instantly cpu goes to 100% and stays there -del(server) and cpu goes back to normal.

Bummer...This is a show stopper for what I need. let me know if any other info is needed. I'm going to have to downgrade and see if an older version doesn't show this behavior.

Update: I downgraded to 3.1.3 and this behavior is not present. Something definitely broke between those two versions, but I don't have any of the 3.2.x branch available to test easily.

cenkalti commented 7 years ago

I can confirm this. Very frustrating 😞

mnaberez commented 7 years ago

@igorsobreira A number of users have reported high CPU usage from supervisord and believe it was introduced by your patch in #129. Another case of high CPU usage was found in #589 and confirmed to be caused by the patch. Could you please look into this? There's reproduce instructions above in https://github.com/Supervisor/supervisor/issues/807#issuecomment-269161194.

plockaby commented 7 years ago

I want to add that of the 76 servers that I have running supervisord, I had to stop/start supervisord on seven of them in the past four days for this bug.

plockaby commented 7 years ago

Now an eighth. A bit more detail now. I am running supervisor 3.3.0. I installed it last May as soon as it came out because it had my patch in it that I really wanted. I restarted supervisor on all my hosts to get the new version and it has been going just fine. Last week, I made a global configuration change. I changed minfds to 100000 (because a specific process wanted it) and I set childlogdir which had not been set before. I installed that new configuration on all my hosts and went around restarting supervisord. Now hosts are going into this poll spin loop at random. This is the strace log. These are spinning like crazy, eating up a whole CPU.

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=22, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=31, events=POLLIN|POLLPRI|POLLHUP}, {fd=35, events=POLLIN|POLLPRI|POLLHUP}, {fd=36, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=41, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}], 18, 1000) = 1 ([{fd=20, revents=POLLIN}])
gettimeofday({1492359703, 190995}, NULL) = 0
gettimeofday({1492359703, 191062}, NULL) = 0
gettimeofday({1492359703, 194873}, NULL) = 0
gettimeofday({1492359703, 194973}, NULL) = 0
gettimeofday({1492359703, 195072}, NULL) = 0
gettimeofday({1492359703, 195108}, NULL) = 0
gettimeofday({1492359703, 195153}, NULL) = 0
gettimeofday({1492359703, 195224}, NULL) = 0
gettimeofday({1492359703, 195254}, NULL) = 0
gettimeofday({1492359703, 195299}, NULL) = 0
gettimeofday({1492359703, 195327}, NULL) = 0
gettimeofday({1492359703, 195378}, NULL) = 0
gettimeofday({1492359703, 195446}, NULL) = 0
wait4(-1, 0x7ffea7758d04, WNOHANG, NULL) = 0
gettimeofday({1492359703, 195526}, NULL) = 0
poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=22, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=31, events=POLLIN|POLLPRI|POLLHUP}, {fd=35, events=POLLIN|POLLPRI|POLLHUP}, {fd=36, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=41, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}], 18, 1000) = 1 ([{fd=20, revents=POLLIN}])
gettimeofday({1492359703, 195874}, NULL) = 0
gettimeofday({1492359703, 195936}, NULL) = 0
gettimeofday({1492359703, 196000}, NULL) = 0
gettimeofday({1492359703, 196092}, NULL) = 0
gettimeofday({1492359703, 196166}, NULL) = 0
gettimeofday({1492359703, 196256}, NULL) = 0
gettimeofday({1492359703, 196336}, NULL) = 0
gettimeofday({1492359703, 196380}, NULL) = 0
gettimeofday({1492359703, 196520}, NULL) = 0
gettimeofday({1492359703, 196557}, NULL) = 0
gettimeofday({1492359703, 196599}, NULL) = 0
gettimeofday({1492359703, 196643}, NULL) = 0
gettimeofday({1492359703, 196689}, NULL) = 0
wait4(-1, 0x7ffea7758d04, WNOHANG, NULL) = 0
gettimeofday({1492359703, 196787}, NULL) = 0

On the host in particular that is crashing right now, these are the fds:

total 0
lr-x------ 1 root root 64 Apr 11 15:42 0 -> /dev/null
lrwx------ 1 root root 64 Apr 11 15:42 1 -> socket:[74731265]
lr-x------ 1 root root 64 Apr 11 15:42 10 -> pipe:[74731311]
lr-x------ 1 root root 64 Apr 11 15:42 11 -> pipe:[74731313]
l-wx------ 1 root root 64 Apr 11 15:42 12 -> /data/logs/supervisor/supermon.log
l-wx------ 1 root root 64 Apr 11 15:42 14 -> pipe:[77877459]
lr-x------ 1 root root 64 Apr 11 15:42 15 -> pipe:[74731314]
lr-x------ 1 root root 64 Apr 11 15:42 16 -> pipe:[77877460]
l-wx------ 1 root root 64 Apr 11 15:42 17 -> /data/logs/supervisor/supercron.log
l-wx------ 1 root root 64 Apr 11 15:42 18 -> /data/logs/supervisor/supercron.err
l-wx------ 1 root root 64 Apr 11 15:42 19 -> pipe:[74731318]
lrwx------ 1 root root 64 Apr 11 15:42 2 -> socket:[74731265]
l-wx------ 1 root root 64 Apr 11 15:42 20 -> /data/logs/supervisor/dart-agent.err
lr-x------ 1 root root 64 Apr 11 15:42 21 -> pipe:[74731319]
lr-x------ 1 root root 64 Apr 11 15:42 22 -> pipe:[77877461]
l-wx------ 1 root root 64 Apr 11 15:42 24 -> pipe:[74731321]
lr-x------ 1 root root 64 Apr 11 15:42 25 -> pipe:[74731320]
lr-x------ 1 root root 64 Apr 11 15:42 26 -> pipe:[74731322]
l-wx------ 1 root root 64 Apr 11 15:42 27 -> /data/logs/supervisor/statsd.log
l-wx------ 1 root root 64 Apr 11 15:42 28 -> /data/logs/supervisor/statsd.err
l-wx------ 1 root root 64 Apr 16 09:27 29 -> pipe:[74731324]
l-wx------ 1 root root 64 Apr 11 15:42 3 -> /data/logs/supervisord.log
lr-x------ 1 root root 64 Apr 16 09:27 30 -> pipe:[74731323]
lr-x------ 1 root root 64 Apr 16 09:27 31 -> pipe:[74731325]
l-wx------ 1 root root 64 Apr 16 09:27 32 -> /data/logs/supervisor/redis.log
l-wx------ 1 root root 64 Apr 16 09:27 33 -> /data/logs/supervisor/redis.err
l-wx------ 1 root root 64 Apr 16 09:27 34 -> pipe:[74731327]
lr-x------ 1 root root 64 Apr 16 09:27 35 -> pipe:[74731326]
lr-x------ 1 root root 64 Apr 16 09:27 36 -> pipe:[74731328]
l-wx------ 1 root root 64 Apr 11 15:42 37 -> /data/logs/supervisor/dmca-admin-web.log
l-wx------ 1 root root 64 Apr 11 15:42 38 -> /data/logs/supervisor/dmca-admin-web.err
l-wx------ 1 root root 64 Apr 16 09:27 39 -> pipe:[74731330]
lrwx------ 1 root root 64 Apr 11 15:42 4 -> socket:[74731287]
lr-x------ 1 root root 64 Apr 16 09:27 40 -> pipe:[74731329]
lr-x------ 1 root root 64 Apr 11 15:42 41 -> pipe:[74731331]
l-wx------ 1 root root 64 Apr 11 15:42 42 -> /data/logs/supervisor/pgcheck.log
l-wx------ 1 root root 64 Apr 11 15:42 43 -> /data/logs/supervisor/pgcheck.err
l-wx------ 1 root root 64 Apr 11 15:42 44 -> /data/logs/supervisor/dart-agent.log
lr-x------ 1 root root 64 Apr 11 15:42 45 -> pipe:[74731332]
l-wx------ 1 root root 64 Apr 11 15:42 47 -> /data/logs/supervisor/pgwatch.log
l-wx------ 1 root root 64 Apr 11 15:42 48 -> /data/logs/supervisor/pgwatch.err
l-wx------ 1 root root 64 Apr 11 15:42 5 -> /data/logs/supervisor/supermon.err
l-wx------ 1 root root 64 Apr 11 15:42 6 -> pipe:[74731309]
lr-x------ 1 root root 64 Apr 11 15:42 7 -> /dev/urandom
lr-x------ 1 root root 64 Apr 11 15:42 8 -> pipe:[74731310]
l-wx------ 1 root root 64 Apr 11 15:42 9 -> pipe:[74731312]
mnaberez commented 7 years ago

@plockaby We're testing a fix for this over in #875. Are you able to try that patch?

plockaby commented 7 years ago

On about 2/3rds of my servers I have now upgraded to 3.3.1 and installed the patch. If it takes over a CPU again I'll let you know but it might be a few days. Sorry I didn't notice the conversation in the pull request.

plockaby commented 7 years ago

That patch did not solve my problem. I almost immediately had one of the updated systems (not the same one as before) go into a poll craziness:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}], 12, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1492367896, 593446}, NULL) = 0
gettimeofday({1492367896, 593515}, NULL) = 0
gettimeofday({1492367896, 593584}, NULL) = 0
gettimeofday({1492367896, 593637}, NULL) = 0
gettimeofday({1492367896, 593719}, NULL) = 0
gettimeofday({1492367896, 593788}, NULL) = 0
gettimeofday({1492367896, 593826}, NULL) = 0
gettimeofday({1492367896, 593863}, NULL) = 0
gettimeofday({1492367896, 593886}, NULL) = 0
gettimeofday({1492367896, 593906}, NULL) = 0
gettimeofday({1492367896, 593948}, NULL) = 0
gettimeofday({1492367896, 593987}, NULL) = 0
gettimeofday({1492367896, 594031}, NULL) = 0
wait4(-1, 0x7ffe0f36f7e4, WNOHANG, NULL) = 0
gettimeofday({1492367896, 594103}, NULL) = 0
poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}], 12, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1492367896, 594331}, NULL) = 0
gettimeofday({1492367896, 594415}, NULL) = 0
gettimeofday({1492367896, 594491}, NULL) = 0
gettimeofday({1492367896, 594561}, NULL) = 0
gettimeofday({1492367896, 594642}, NULL) = 0
gettimeofday({1492367896, 594663}, NULL) = 0
gettimeofday({1492367896, 594699}, NULL) = 0
gettimeofday({1492367896, 594739}, NULL) = 0
gettimeofday({1492367896, 594769}, NULL) = 0
gettimeofday({1492367896, 594808}, NULL) = 0
gettimeofday({1492367896, 594836}, NULL) = 0
gettimeofday({1492367896, 594887}, NULL) = 0
gettimeofday({1492367896, 594934}, NULL) = 0
wait4(-1, 0x7ffe0f36f7e4, WNOHANG, NULL) = 0
gettimeofday({1492367896, 595005}, NULL) = 0
lr-x------ 1 root root 64 Apr 16 10:37 0 -> /dev/null
lrwx------ 1 root root 64 Apr 16 10:37 1 -> socket:[36606016]
lr-x------ 1 root root 64 Apr 16 10:37 10 -> pipe:[36606043]
lr-x------ 1 root root 64 Apr 16 10:37 11 -> pipe:[36606045]
l-wx------ 1 root root 64 Apr 16 10:37 12 -> /data/logs/supervisor/supermon.log
l-wx------ 1 root root 64 Apr 16 10:37 13 -> /data/logs/supervisor/supermon.err
l-wx------ 1 root root 64 Apr 16 10:37 14 -> pipe:[36606047]
lr-x------ 1 root root 64 Apr 16 10:37 15 -> pipe:[36606046]
lr-x------ 1 root root 64 Apr 16 10:37 16 -> pipe:[36606048]
l-wx------ 1 root root 64 Apr 16 10:37 17 -> /data/logs/supervisor/supercron.log
l-wx------ 1 root root 64 Apr 16 10:37 18 -> /data/logs/supervisor/supercron.err
l-wx------ 1 root root 64 Apr 16 10:37 19 -> pipe:[36606050]
lrwx------ 1 root root 64 Apr 16 10:37 2 -> socket:[36606016]
lr-x------ 1 root root 64 Apr 16 10:37 20 -> pipe:[36606049]
lr-x------ 1 root root 64 Apr 16 10:37 21 -> pipe:[36606051]
l-wx------ 1 root root 64 Apr 16 10:37 22 -> /data/logs/supervisor/dart-agent.log
l-wx------ 1 root root 64 Apr 16 10:37 24 -> pipe:[36606053]
lr-x------ 1 root root 64 Apr 16 10:37 25 -> pipe:[36606052]
lr-x------ 1 root root 64 Apr 16 10:37 26 -> pipe:[36606054]
l-wx------ 1 root root 64 Apr 16 10:37 27 -> /data/logs/supervisor/pgcheck.log
l-wx------ 1 root root 64 Apr 16 10:37 28 -> /data/logs/supervisor/pgcheck.err
l-wx------ 1 root root 64 Apr 16 10:37 3 -> /data/logs/supervisord.log
lr-x------ 1 root root 64 Apr 16 10:37 30 -> pipe:[36606055]
l-wx------ 1 root root 64 Apr 16 10:37 32 -> /data/logs/supervisor/pgwatch.log
l-wx------ 1 root root 64 Apr 16 10:37 33 -> /data/logs/supervisor/pgwatch.err
lrwx------ 1 root root 64 Apr 16 10:37 4 -> socket:[36606038]
l-wx------ 1 root root 64 Apr 16 10:37 5 -> /data/logs/supervisor/dart-agent.err
l-wx------ 1 root root 64 Apr 16 10:37 6 -> pipe:[36606041]
lr-x------ 1 root root 64 Apr 16 10:37 7 -> /dev/urandom
lr-x------ 1 root root 64 Apr 16 10:37 8 -> pipe:[36606042]
l-wx------ 1 root root 64 Apr 16 10:37 9 -> pipe:[36606044]
mnaberez commented 7 years ago

@plockaby I posted this over in #875. Can you please work with us there to reproduce?

plockaby commented 7 years ago

Can do. I'll move to commenting over there instead.

plockaby commented 7 years ago

I hate to harp on this but #875 didn't solve my problem and I'm still restarting supervisord every few days on random hosts. I just wanted to bring up that this is still happening.

blusewang commented 7 years ago

@plockaby I suggest you try to use systemd instead supervisor!

mnaberez commented 7 years ago

I hate to harp on this but #875 didn't solve my problem and I'm still restarting supervisord every few days on random hosts.

@plockaby I was able to reproduce one cause of high CPU and observe that #875 did fix it, which is why I merged #875. I haven't observed another cause of high CPU usage yet. If you are still experiencing problems with #875 applied, we are going to need reproduce instructions to show us how to get supervisord into that state so we can work on it.

mnaberez commented 7 years ago

Supervisor 3.3.2 was released to PyPI and it includes the poller fixes (#875, #885).

The only high CPU problem that I was able to reproduce is fixed by this release.

If you are experiencing a high CPU problem on Supervisor 3.3.2, please open a new issue and include instructions to reproduce it.

plockaby commented 7 years ago

I've been on vacation so I'm just getting to this now.

I tried 3.3.2 and it didn't solve the CPU spikes. I actually rolled all of my hosts back to 3.2.3 under the assumption that it was a problem with the new polling code. I'm still experiencing supervisord going into polling loops that use 100% CPU.

I understand that you can't fix a problem that you can't reproduce. If I had any way to reproduce this consistently, I would show you. All I can give you, really, is situations where it is happening. Right now I can tell you that this is not caused by changes in 3.3.x. I can also tell you that it is (probably) not caused by multiple processes writing to their logs simultaneously as this doesn't happen with any more frequency on our heavily loaded hosts as compared to our lightly loaded hosts. In fact it has happened on hosts that have just one process controlled by supervisord just as often as it has happened on hosts that have forty processes.

If I ever do figure out how to reproduce it then I will create a new ticket. Would you mind if I continue to post information about when I experience the issue?

mnaberez commented 7 years ago

Would you mind if I continue to post information about when I experience the issue?

Of course not. Please do that.

plockaby commented 7 years ago

Interestingly I've discovered that it often recovers from the polling problem. I identified one of my hosts that had started spinning CPU on polling but it wasn't convenient to restart the supervisord at the time so I just left top running in a window on the side. Six days of 100% CPU and it just stopped as suddenly as it started. Another one of the hosts that I identified ran at 100% CPU for nearly five days before it stopped but then it started again six hours later. (I've identified all of these by looking at collectd CPU graphs.)

I have an strace running on one of the hosts where the problem has cropped up several times and where I don't mind if the disk fills. When it starts on that host then I should have a bit more information about what is starting and/or stopping the problem.

I have also noticed that it is not happening on Centos7 hosts. It so far has only happened on Centos6 hosts. Unfortunately most of my hosts are still on Centos6. :)

plockaby commented 7 years ago

I've narrowed the problem down even further to be specific to the interaction between supervisord and an event listener on the communications channel between the two (i.e. my event listener's stdout). The file descriptor that is generating all of the poll events is my event listener's stdout file.

This explains why the problem would disappear: my event listener is configured to restart itself at random intervals and indeed when I restart it the 100% CPU spike goes away. I'm still examining the strace to figure out why the problem starts in the first place. The log itself isn't blowing up in size with insane numbers of writes.

plockaby commented 7 years ago

I was going to write a big long description with quotes from my strace but I think I discovered the problem. The problem happens when my event listener's log is rotated. Because it is an event listener and I really don't care about stdout for event listeners, my configuration is like this:

stdout_logfile = /data/logs/supervisor/dart-agent.log
stdout_logfile_maxbytes = 1MB
stdout_logfile_backups = 0

When the strace says that my program started going crazy, this appears in the strace:

open("/data/logs/supervisor/dart-agent.log", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 5
fstat(5, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0

That is the log file for my event listener and "5" is the file descriptor for its log file that I am currently spinning on polling:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=12, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=31, events=POLLIN|POLLPRI|POLLHUP}, {fd=35, events=POLLIN|POLLPRI|POLLHUP}], 14, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1498848417, 352015}, NULL) = 0
gettimeofday({1498848417, 352079}, NULL) = 0
gettimeofday({1498848417, 352141}, NULL) = 0
gettimeofday({1498848417, 352204}, NULL) = 0
gettimeofday({1498848417, 352268}, NULL) = 0
gettimeofday({1498848417, 352331}, NULL) = 0
gettimeofday({1498848417, 352483}, NULL) = 0
gettimeofday({1498848417, 352590}, NULL) = 0
gettimeofday({1498848417, 352769}, NULL) = 0
gettimeofday({1498848417, 352858}, NULL) = 0
wait4(-1, 0x7fff3600b334, WNOHANG, NULL) = 0
gettimeofday({1498848417, 352992}, NULL) = 0

I hope that is enough to go on?

plockaby commented 7 years ago

I have just discovered that this is not a problem specific to event listeners. I had a non-event listener program just trigger the same problem. However that's the first time that has happened that I can tell.

lambdaq commented 7 years ago

I once deleted few log files and CPU stuck to 100%. Is it relevant ?

mumutu66 commented 6 years ago

got the same problem too 。。on Centos7 with 3.3.1 is it confirmed 3.3.2 can solve this?

plockaby commented 6 years ago

FWIW I wanted to provide an update. This is still a problem. It is specific to event listeners and it happens on 3.2 and 3.3. (I have not tested on 3.1 or on master.) The specific problem is related to when the stdout log file rotates. I worked around it by just completely disabling the stdout log file by writing stdout_logfile = NONE in my event listener's configuration. Since stdout on event listeners is mostly useless this was not a big loss. Plus, it completely solved my problem so there you go.

NKYB commented 6 years ago

FWIW I can confirm that I've experienced almost exactly the same problem as described by @plockaby

plockaby commented 6 years ago

My solution that has me setting stdout_logfile = NONE does not completely solve the problem. Now it spins at 100% to POLLIN the stderr log file on rotation. I don't need to reboot the host, though. I just restart the event listener and the problem clears until the next time the log rotates. I suppose I could just not keep any logs from my event listeners?

engin commented 6 years ago

i think mine was caused by redirect stderr in one box ~ 120 job watched by supervisord. and cpu usage of supervisord was around 99%

# supervisorctl version
3.3.4
# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.2 LTS
Release:    14.04
Codename:   trusty
find /etc/supervisor/conf.d -name "*.conf" -exec sh -c 'echo "redirect_stderr=true" >> "$0" ' {} \;

now dropped to below 2%

devspyrosv commented 6 years ago

Hello! I had a similar experience with supervisord 3.3.4

lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 16.04.4 LTS Release: 16.04 Codename: xenial

strace -p

Shows tens of messages per second like the following:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=9, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=12, events=POLLIN|POLLPRI|POLLHUP}, {fd=18, events=POLLIN|POLLPRI|POLLHUP}, {fd=19, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=28, events=POLLIN|POLLPRI|POLLHUP}, {fd=32, events=POLLIN|POLLPRI|POLLHUP}, {fd=38, events=POLLIN|POLLPRI|POLLHUP}, {fd=39, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=47, events=POLLIN|POLLPRI|POLLHUP}, {fd=54, events=POLLIN|POLLPRI|POLLHUP}, {fd=55, events=POLLIN|POLLPRI|POLLHUP}, {fd=61, events=POLLIN|POLLPRI|POLLHUP}, {fd=65, events=POLLIN|POLLPRI|POLLHUP}], 18, 1000) = 0 (Timeout)

Supervisord has been running for ~1.5 months and is managing 10 processes.

supervisorctl status all

Shows the same process ids (no process is constantly restarting etc).

Having read the comments above, I decided to do a little experiment. So I started stopping one by one the services supervisord is managing, through the web interface.

So among the custom processes I am running, I have crashmailbatch and fatalmailbatch from the package superlance.

So I was seeing the same frequency of polls in strace up untill I stopped fatalmailbatch.

Fatalmailbatch has the following config:

[eventlistener:fatalmailbatch] command=fatalmailbatch --toEmail= --fromEmail= --smtpHost= --userName=... --password=... --subject="Alert - MYDOMAIN has crashed processes!" --interval=5.0 events=PROCESS_STATE,TICK_60 autostart=true autorestart=true

The process was running for as long as supervisord was running without stopping.

I hope it helps somehow.

plockaby commented 6 years ago

I know that earlier I stated that this was being caused by an event listener and that by modifying the configuration for the event listener to not log that the problem went away. The fact that it was an event listener was apparently a red herring because I'm experiencing this problem yet again where supervisord is spinning like crazy on POLLIN on a log file. The only commonality that I'm seeing between all of the systems where this is happening is that the file handle is consistently numbered "5" on every server and that it happens after a log rotation and that when I restart the process with the log file where it is spinning that the spinning stops. It's possible that this is also related to log events which I also have enabled. I'd love to get some traction on this bug as it has been plaguing me for almost two years. If the maintainer could give me some pointers on how I can assist in getting this addressed I'd be grateful.

mnaberez commented 6 years ago

If the maintainer could give me some pointers on how I can assist in getting this addressed I'd be grateful.

I am one of the people with commit access to this repository, but I was not involved with any of the code in question. I am not very familiar with this particular code. The problem seems to have been introduced by #129, which was written by @igorsobreira and merged by @mcdonc. I attempted to contact @igorsobreira after reports of problems but he did not reply. Another contributor, @jaredsuttles, put some effort into the high CPU issue and submitted #875, which I merged. I appreciate all your follow-up, @plockaby, and would like see this resolved also. Help is needed.

p01ymath commented 6 years ago

In my case, It started to happen a few days back. Now I have to restart supervisor once every day otherwise it would create unusual spikes. I am using Supervisor 3.2.0 and the issue is still there. Any way we can solve this issue permanently? I haven't installed it from a git repository.

hathawsh commented 6 years ago

@p01ymath: This bug has been tricky because we don't know how to reproduce it, but it sounds like you're running into it consistently. Your case could be key to solving it. Can you describe what programs Supervisor is running?

Some types of child processes do unusual things like closing the stdout file descriptor. Unusual behaviors can influence the operation of Supervisor and reveal issues.

p01ymath commented 6 years ago

I ran into it 2 times till now and it happens at random times. I am running Laravel queue workers using supervisor. The first time I faced this issue, I turned off logging completely and restarted supervisor. It worked so I thought it was the issue of logging.

After 2-3 days, It happened again. And this time Logs were already turned off. So I didn't restart the supervisor directly. From htop, I found out the worker that was using the majority of my CPU. So, I just turned off that individual process. Still, My CPU usage was 100% so I checked htop again and found out that the other worker/process in the same program started using 100% of CPU. Then I turned off the whole process group and found out that the other worker in other process group started using all the CPU.

Then, I restarted supervisor and it worked. This happened about 3 days ago but I haven't faced any issue since then. I am using supervisor 3.2.0 and I have 5 programs with 12 queue workers.

hathawsh commented 6 years ago

@p01ymath, here's a guess: is it possible that some of your child processes are sending a large amount of log data through Supervisor? I'm not sure how much log data Supervisor can reasonably handle, but I suspect it would choke on a lot of logging data, especially if the child process happens to send only a character at a time.

hathawsh commented 6 years ago

There's a new tool available called py-spy and I think it will greatly help anyone struggling with high CPU usage in Supervisor or any other Python process. If you're using a virtualenv, install it with 'bin/pip install py-spy' and run it using 'sudo bin/py-spy --pid PID' where PID is the PID of the supervisord process. It will show a 'top'-like breakdown of what the process is doing. Use ctrl-S to pause its output (for copy/paste) and ctrl-Q to resume.

AlirezaSadeghi commented 5 years ago

I'm also facing a 100% CPU usage load when supervised processes are taking 100% cpu themselves too. Here's the (top) output of Py-Spy:

 50.50%  50.50%    3.14s     3.14s   select (dist-packages/supervisor/options.py:1156)
  4.50%   9.50%   0.185s    0.355s   get_process_map (dist-packages/supervisor/supervisord.py:138)
  4.00%   5.00%   0.250s    0.305s   runforever (dist-packages/supervisor/supervisord.py:212)
  3.50%  11.00%   0.335s    0.655s   runforever (dist-packages/supervisor/supervisord.py:192)
  3.00%   3.00%   0.150s    0.150s   get_dispatchers (dist-packages/supervisor/process.py:691)
  3.00%   4.00%   0.225s    0.255s   runforever (dist-packages/supervisor/supervisord.py:214)
  3.00%   3.00%   0.140s    0.140s   get_dispatchers (dist-packages/supervisor/process.py:692)
  2.50%   3.00%   0.080s    0.105s   runforever (dist-packages/supervisor/supervisord.py:211)
  1.50%   1.50%   0.035s    0.035s   transition (dist-packages/supervisor/process.py:522)

Supervisor is taking a constant 80~100% CPU on my hosts, it's managing about 30 NodeJS processes, on a 32 Core machine.

(I'm using V3.3.4)

Any idea what that comes from or whether it's expected?

dexterbt1 commented 5 years ago

I recently deployed v3.3.4 across 6 servers running Ubuntu 18.04. Each server supervises as little as 5 and up to 70+ processes.

After about a few days, almost all of the servers exhibited high cpu usage issues. Restarting/Rebooting does not help, as the problem reoccurs immediately. A server having 5 supervised child processes still experiences 100% cpu; same with the 70 process server.

I downgraded all these servers to v3.1.0 and the problem disappeared! Right now, it's been 2 weeks with no issues at all.

I suspect the problem is related to stdout/stderr handling of supervised processes; as the CPU usage goes down if I disable/minimize the stdout/stderr output.

I'm reading a lot of comments re: the select/poll implementation and I say this is right on track.

devspyrosv commented 5 years ago

Caught one of my supervisord instances doing the POLL thing with 100% CPU usage. The logs of the processes were empty - no process was logging at the time. In past encounters with this bug, I had tried without success the following:

This time when I found out that there was a CPU spike, I started strace and just opened the Web Interface.

The continuous polling stopped and the CPU usage became normal again.

This is no proof of course but it got me thinking.

hathawsh commented 5 years ago

Great clues, guys. Keep them coming if you can.

devspyrosv commented 5 years ago

I have 3 servers with identical specs running different counts of the same list of processes.

So let's say the servers are s1, s2 and s3, the processes A, B, C, D, E and the counts of the running processes are shown in the following table

A B C D
s1 5 2 1 4
s2 2 2 4 4
s3 3 3 3 3

All the processes have the supervisord logging disabled (they log in their own separate log files). All the servers have exactly the same packages and updates (ubuntu 16.04). All the processes use the same python version.

Server s1 has not had any problems with high CPU usage from supervisord. Servers s2 and s3 have the problem with the high CPU usage due to continuous polling (shown by strace -p pid_of_supervisord). Usually I restart the processes one by one while checking strace up to the point where strace shows that the continuous polling stops. Some times the first process to be restarted fixes the problem. Sometimes its after 4-5 processes restarted. It's not always the same process that stops the polling by being restarted. It's not always the same time that the process was running up to the point that its restart stops the polling. Last but not least, the processes have nothing in common as to what they do, which non builtin libraries they use etc. The only common thing they have is that they are all written in python3 and run under python 3.5.x.

The restarts I mention are made through the web UI.

justinpryzby commented 5 years ago

Affecting us on superv 3.3.5

(gdb) py-bt Traceback (most recent call first): File "/usr/lib/python2.7/site-packages/supervisor/poller.py", line 91, in register_readable self._poller.register(fd, self.READ) File "/usr/lib/python2.7/site-packages/supervisor/supervisord.py", line 208, in runforever self.options.poller.register_readable(fd) File "/usr/lib/python2.7/site-packages/supervisor/supervisord.py", line 94, in run self.runforever() File "/usr/lib/python2.7/site-packages/supervisor/supervisord.py", line 77, in main self.run() File "/usr/lib/python2.7/site-packages/supervisor/supervisord.py", line 371, in go d.main()

Spinning continuously doing this wait4(-1, 0x7ffc8bb32e94, WNOHANG, NULL) = 0 poll([ [many FDs elided here...], {fd=114, events=POLLIN|POLLPRI|POLLHUP}], 42, 1000) = 1 ([{fd=114, revents=POLLIN}])

It shows data available to read, but doesn't actually read(), instead immediately poll()ing again again which still still showing (same) data available to be read.

$ sudo ls -l /proc/1134/fd/114 l-wx------. 1 root root 64 Mar 21 09:53 /proc/1134/fd/114 -> /home/rina-main/server/log/log.watch_rinaenm1

That's a stdout logfile (with redirect_stderr=true), related to this process: alarms_watchers:watch_rinaenm1 RUNNING pid 32005, uptime 2:10:49

I'm able to tail it fine, showing recent data: TelsaSupervisor> tail -44 alarms_watchers:watch_rinaenm1 (2019-03-21 09:57:20.992811) 0:00:04.332918

I "signalled" the process and issue appears to be worked around for now.

Edit: just occured to me: why is it polling a logfile for reading, anyway ?

Edit, just happened again. This time, it's getting POLLOUT rather than POLLIN, but still on an stdout logfile, and the last FD in poll's set. This time the process would never have written anything, ever (but the logfile is ~36MB from a previous command which DID write stuff). poll([ ... {fd=62, events=POLLOUT}], 22, 1000) = 1 ([{fd=62, revents=POLLOUT}]) l-wx------. 1 root root 64 Mar 24 00:40 /proc/77842/fd/62 -> /home/telsa-inet/server/log/log.eric_eutran

[pryzbyj@appserver ~]$ sudo lsof -n -p 77842 |grep 62 superviso 77842 root 62w REG 253,1 37056758 454849 /home/telsa-inet/server/log/log.eric_eutran

-rw-r--r--. 1 root root 36M Feb 26 13:25 /home/telsa-inet/server/log/log.eric_eutran

justinpryzby commented 5 years ago

Would you let's know what else we could provide to help resolve the misbehavior ?

justinpryzby commented 5 years ago

I didn't hear back, but one issue appears to be that FDs aren't necessarily unregistered when they're closed, so file later opened can be assigned some FD which is still included in polls indefinitely into the future.

I'm able to reproduce that portion of the issue using 9 processes and variously signalling a couple of them: signal hup c4 c5. That may take a few tries to trigger, but with good consistency adds to "r" or "w" list an FD previously in socket_map for the supervisorctl client (I'm connecting over unix socket) but later used as a stdout logfile (but never removed from the list of FDs to be polled).

[program:c1] stdout_logfile = log/log.%(program_name)s redirect_stderr = true autorestart = true startretries = 999999999 command = sh -c 'echo; sleep 9; date'

I'm able to avoid the issue with this change: if eventmask & (select.POLLNVAL|select.POLLHUP):

One opened question in my mind, is if missing combined_map could ever legitimately happen or if missing keys are only due to bugs and should instead be asserted (or allowed to crash). if fd in combined_map:

@igorsobreira , would you comment ?

Ruby0510 commented 5 years ago

Hi, @hathawsh any update from your team about this issue? I encountered the same problem: stuck in polling loop when start multiple child processes. It happened at my side after upgrade from 3.2.3 to latest version. The workaround to disable to log child-process can make supervisor's cpu usage back to normal. But it's better you can prioritize this issue or if it was already been prioritized, good to get some sharing about the process. Thanks!

BR Ruby

victor-ponamariov commented 4 years ago

I'm using supervisor on Ubuntu 18.04 version. From time to time I see that CPU usage got to 100% and htop shows that this is supervisord who takes all the CPU.

Previously I had the same issue with 3.3.0 and I found out that there was a fix in 3.3.2 regarding high cpu usage. So I updated to 4.2.0 and today got the same thing on one of my nodes.

strace shows the same thing it showed on 3.3.0:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=9, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=12, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=18, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=24, events=POLLIN|POLLPRI|POLLHUP}, {fd=27, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=33, events=POLLIN|POLLPRI|POLLHUP}, {fd=36, events=POLLIN|POLLPRI|POLLHUP}, {fd=39, events=POLLIN|POLLPRI|POLLHUP}, {fd=42, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}, {fd=48, events=POLLIN|POLLPRI|POLLHUP}, {fd=51, events=POLLIN|POLLPRI|POLLHUP}, {fd=54, events=POLLIN|POLLPRI|POLLHUP}, {fd=57, events=POLLIN|POLLPRI|POLLHUP}, {fd=60, events=POLLIN|POLLPRI|POLLHUP}, {fd=63, events=POLLIN|POLLPRI|POLLHUP}, {fd=66, events=POLLIN|POLLPRI|POLLHUP}, {fd=69, events=POLLIN|POLLPRI|POLLHUP}, {fd=72, events=POLLIN|POLLPRI|POLLHUP}, {fd=75, events=POLLIN|POLLPRI|POLLHUP}, {fd=78, events=POLLIN|POLLPRI|POLLHUP}, {fd=81, events=POLLIN|POLLPRI|POLLHUP}, {fd=84, events=POLLIN|POLLPRI|POLLHUP}, {fd=87, events=POLLIN|POLLPRI|POLLHUP}, {fd=90, events=POLLIN|POLLPRI|POLLHUP}, {fd=93, events=POLLIN|POLLPRI|POLLHUP}, ...], 46, 1000) = 1 ([{fd=12, revents=POLLIN}])

Version:

supervisord --version
4.2.0

htop output:

image

supervisor config:

[program:queue-dedicated]
process_name=%(program_name)s_%(process_num)02d
command=php /home/pinger/pinger/artisan queue:work redis-dedicated --queue=toronto --sleep=3 --tries=3
autostart=true
autorestart=true
user=pinger
numprocs=2
redirect_stderr=true
stdout_logfile=/home/pinger/pinger/worker.log
stopwaitsecs=30

[program:queue-local]
process_name=%(program_name)s_%(process_num)02d
command=php /home/pinger/pinger/artisan queue:work redis-local --sleep=3 --tries=3
autostart=true
autorestart=true
user=pinger
numprocs=40
redirect_stderr=true
stdout_logfile=/home/pinger/pinger/worker.log

[program:checks-push]
process_name=%(program_name)s_%(process_num)02d
command=php /home/pinger/pinger/artisan checks:push
autostart=true
autorestart=true
user=pinger
numprocs=1
redirect_stderr=true
stdout_logfile=/home/pinger/pinger/checksPushErrors.log

Does anyone have the same thing ? Maybe I've misconfigured something.

Ruby0510 commented 4 years ago

Any plan with this issue?

ollofx commented 3 years ago

I'm using supervisor on Ubuntu 18.04 version. From time to time I see that CPU usage got to 100% and htop shows that this is supervisord who takes all the CPU.

Previously I had the same issue with 3.3.0 and I found out that there was a fix in 3.3.2 regarding high cpu usage. So I updated to 4.2.0 and today got the same thing on one of my nodes.

strace shows the same thing it showed on 3.3.0:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=9, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=12, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=18, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=24, events=POLLIN|POLLPRI|POLLHUP}, {fd=27, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=33, events=POLLIN|POLLPRI|POLLHUP}, {fd=36, events=POLLIN|POLLPRI|POLLHUP}, {fd=39, events=POLLIN|POLLPRI|POLLHUP}, {fd=42, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}, {fd=48, events=POLLIN|POLLPRI|POLLHUP}, {fd=51, events=POLLIN|POLLPRI|POLLHUP}, {fd=54, events=POLLIN|POLLPRI|POLLHUP}, {fd=57, events=POLLIN|POLLPRI|POLLHUP}, {fd=60, events=POLLIN|POLLPRI|POLLHUP}, {fd=63, events=POLLIN|POLLPRI|POLLHUP}, {fd=66, events=POLLIN|POLLPRI|POLLHUP}, {fd=69, events=POLLIN|POLLPRI|POLLHUP}, {fd=72, events=POLLIN|POLLPRI|POLLHUP}, {fd=75, events=POLLIN|POLLPRI|POLLHUP}, {fd=78, events=POLLIN|POLLPRI|POLLHUP}, {fd=81, events=POLLIN|POLLPRI|POLLHUP}, {fd=84, events=POLLIN|POLLPRI|POLLHUP}, {fd=87, events=POLLIN|POLLPRI|POLLHUP}, {fd=90, events=POLLIN|POLLPRI|POLLHUP}, {fd=93, events=POLLIN|POLLPRI|POLLHUP}, ...], 46, 1000) = 1 ([{fd=12, revents=POLLIN}])

Version:

supervisord --version
4.2.0

htop output:

image

supervisor config:

[program:queue-dedicated]
process_name=%(program_name)s_%(process_num)02d
command=php /home/pinger/pinger/artisan queue:work redis-dedicated --queue=toronto --sleep=3 --tries=3
autostart=true
autorestart=true
user=pinger
numprocs=2
redirect_stderr=true
stdout_logfile=/home/pinger/pinger/worker.log
stopwaitsecs=30

[program:queue-local]
process_name=%(program_name)s_%(process_num)02d
command=php /home/pinger/pinger/artisan queue:work redis-local --sleep=3 --tries=3
autostart=true
autorestart=true
user=pinger
numprocs=40
redirect_stderr=true
stdout_logfile=/home/pinger/pinger/worker.log

[program:checks-push]
process_name=%(program_name)s_%(process_num)02d
command=php /home/pinger/pinger/artisan checks:push
autostart=true
autorestart=true
user=pinger
numprocs=1
redirect_stderr=true
stdout_logfile=/home/pinger/pinger/checksPushErrors.log

Does anyone have the same thing ? Maybe I've misconfigured something.

@victor-ponamariov , I had almost the same issue as your supervisord entring 100% CPU state I finally go to the bottom of it, for my use case, maybe it is not exactly the same for you:

when you setup stopwaitsecs=600 seconds for [program:test_cpu] then if you call supervisorctl stop test_cpu, and the process is stop waiting (busy),

there is a flood of socket events between supervisorctl process (client) and supervisord (server), I guess to check continuously if the status of the underlying program has changed (exited):

the supervisord server logs continuously till the underlying job has stopped:

2021-08-11 17:28:54,040 BLAT write event caused by <supervisor.http.deferring_http_channel connected '' at 0x7fd737b8e7a0 channel#: 1 requests:1>....

this continuous processing put the CPU to 100%

the supervisorctl could be killed after stop request, then the connection with supervisord is closed and there is no flooding and the CPU is back, this is a solution till this specific bug is solved