Supervisor / supervisor

Supervisor process control system for Unix (supervisord)
http://supervisord.org
Other
8.48k stars 1.25k forks source link

100% CPU usage (maybe caused by new poll implementation?) (help wanted) #807

Open timonwong opened 8 years ago

timonwong commented 8 years ago

We were using supervisor 3.3.0 on Ubuntu 14.04 LTS.

Recently on some of our nodes on AWS, we spot very high cpu usage about supervisord, to get around that, we have to reload it, but it may happen again in one day.

Reading from strace, we spot there are excessive calls to both 'gettimeofday' and 'poll', so after that, we have to choose to downgrade supervisor to 3.2.3.

I see there was #581, but I think it's irrelevant here, our wild guess is it just caused by the new poll implementation introduced in 3.3.0 (and maybe caused by simultaneous log outputs?)...

Thanks in advance!

ollofx commented 3 years ago

After debugging supervisord, the issue is pretty clear, it is due to poll implementation there is select done on all fds open by supervisord, some of which are read & write the one with request supervisorctl stop test_cpu is hanging waiting for a response and the server supervisord has nothing to write yet because test_cpu hasn't exited yet so it loops continuously: 1- no exit event 2- write fd is open to write something 3- actually there nothing yet to write then back to 1, without anything to avoid continuous loop, we cloud avoid it by for instance doing a minimum sleep time for each iteration. a quick fix till I have time to look at it further or a maintainer is able to make a proper pull request to fix it: supervisor/poller.py:113 (supervisor version 4.1.0)

    import time
    def poll(self, timeout):
        fds = self._poll_fds(timeout)
        # avoids 100% CPU when program is in STOPPING state and client supervisorctl write socket open waiting
        time.sleep(1)
        readables, writables = [], []
        for fd, eventmask in fds:
            if self._ignore_invalid(fd, eventmask):
                continue
            if eventmask & self.READ:
                readables.append(fd)
            if eventmask & self.WRITE:
                writables.append(fd)
        return readables, writables
justinpryzby commented 2 years ago

Still an issue - just hit it on 4.1.0. In this case, I cannot start supervisorctl: supervisorctl -c .../supervisord.conf error: <class 'ConnectionResetError'>, [Errno 104] Connection reset by peer: file: /usr/lib64/python3.6/socket.py line: 586

[pryzbyj@telsasoft ~]$ while sleep 1; do ps h -ww -o time,comm 27978; done 07:47:51 supervisord 07:47:52 supervisord 07:47:53 supervisord 07:47:54 supervisord

kewde commented 2 years ago

POLLOUT will return immediately if you're connecting to an unresponsive peer - cause heavy CPU usage.

mnaberez commented 2 years ago

https://github.com/Supervisor/supervisor/issues/1515#issue-1279980023

We found high CPU consumption (our java application 70%, supervisor 30%) when using supervisor After running strace, it turned out that this consumption is associated with redirecting application's stdout to log file - supervisor using system calls read/write/select in python2 and poll in python3 - and apparently the main cpu time is spent on select syscall. Also we can't use kqueue since we use ubuntu

We need logs, so we want to reduce CPU consumption without turning them off So far, the only thing that comes to mind is using epoll instead of select/poll to wait for descriptors in the hope that it will help

Are there any plans to add epoll to pollers list? If so, when will it happen? And if not, is it worth adding its support on my own? Are there any pitfalls here?

Or maybe you can suggest how to solve this problem in some other way?

aleph0naught commented 2 years ago

@mnaberez

Can you please answer the question about epoll that I asked in closed issue?

So far, the only thing that comes to mind is using epoll instead of select/poll to wait for descriptors in the hope that it will help

Are there any plans to add epoll to pollers list? If so, when will it happen? And if not, is it worth adding its support on my own? Are there any pitfalls here?

mnaberez commented 2 years ago

Hi, I don't have any answers to that, sorry, I only closed it to keep all the comments on this issue in one place.

aleph0naught commented 2 years ago

Hi, I don't have any answers to that, sorry, I only closed it to keep all the comments on this issue in one place.

Ok! I reopened new issue with more specific title I think this is independent task from the supervisor's point of view (task about adding epoll poller), despite the fact that it is related to this issue

ollofx commented 2 years ago

indeed it is a poll issue this is the patch that I applied to avoid continuous looping, but it extend a bit minimum response time (sleep 1s but you could put much less) I use this patch in production since more then a year.

added time.sleep(1), you might chose less, the only purpose is to avoid continuous loop

/usr/local/lib/python3.7/site-packages/supervisor/poller.py:116 /usr/lib/python2.7/dist-packages/supervisor/poller.py:116

   def poll(self, timeout):
        fds = self._poll_fds(timeout)
        readables, writables = [], []
        time.sleep(1)
        for fd, eventmask in fds:
            if self._ignore_invalid(fd, eventmask):
                continue
            if eventmask & self.READ:
                readables.append(fd)
            if eventmask & self.WRITE:
                writables.append(fd)
        return readables, writables
mnaberez commented 2 years ago

A patch was proposed three times identically (originally here, the other two marked as duplicate):

   def poll(self, timeout):
        fds = self._poll_fds(timeout)
        readables, writables = [], []
        time.sleep(1)  # it was proposed to add this sleep()

I don't recommend users apply this patch because it is probably just masking an underlying problem. This may appear to work but it may also cause supervisord to become sluggish. For example, sending supervisorctl commands may take 1 second longer to respond. Log files may appear to freeze and then dump a lot of data at once after 1 second. In the worse case, processes running under supervisord may hang or crash if their pipes fill up because supervisord can't empty them fast enough. The interval could be lowered, e.g. time.sleep(0.1), but supervisord is a single-threaded event loop that's not supposed to block. I don't think we could merge a patch like that for these reasons.

poll_fds is intended to return immediately only if there is any change on the file descriptors it monitors, or a timeout occurs. If the timeout occurs, supervisord does some very minimally processing and immediately goes back into poll_fds. In the case of supervisord spinning on poll_fds, it means poll_fds is always returning immediately. It's not supposed to do that. The fix would be figure out why it is always returning immediately and solve it.

Reading through this thread, it seems that there have been several different causes of this issue, at least one of which was solved with a linked commit. I don't think any current maintainers know how to reproduce it. It would be extremely helpful if someone could send a minimal reproduce case that causes supervisord to spin at 100% CPU. For example: with this minimal configuration file, run supervisord on this operating system, and you will see it spin. This problem doesn't always occur, e.g. it has never been seen in pre-release testing or on CI, which is why reproduce instructions are needed.

ollofx commented 2 years ago

thanks for the moderation, and the clear explanation concerning the delay implied with this quick patch (not meant to solve the real issue).

I forgot that it was me who posted a year ago how I avoided CPU spike, anyway my objective was to only point out the issue to a maintainer who might be able to really fix it.

I posted the sleep fix, to allow people who were in the same case as me with production constraints, to avoid 100% CPU and keep using supervisord, at the cost of having some delay for supervisorctrl command, I've put 1s just to make sure people realize it has this cost.

for your last paragraph, I'm not sure all CPU spikes are the results of that, but for my case, it was happening when supervisorctrl request stopping of a process, that catch the signal and needed to wait for some tasks to finish before quitting, as database shutdown, or webserver active requests to finish (it was basically catching signal and not exiting quickly as supervisor expected)

actually this is how the problem can be replicated, and it is systematic, and it was mentioned in the post 1 year ago, but was not clear enough, I though a maintainer would find some technical description useful.

basically if you want to replicate the issue, just make the process for which you request to stop ignore the stop signal.

the supervisorctl will be continuously calling supervisord which will respond continuously without any break, and creates a continuous loop that eat up 100% of CPU till the underlying process decide to quit.(hopefully you have many CPUs then it is not 100%)

maybe the only reason not so many people complain about this, is that not so many people manage process such that they wait for them more then 30mins to quit, actually the setup I'm dealing with is meant to allow processes waiting such long time to quit, and I can't afford to have 1 CPU out during all that time.

[program:pgsql]
# if error then wait 10s before next retry
command=/supervisor/once_10.sh '/pgsql.sh postgres'
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stdout
stderr_logfile_maxbytes=0
autostart=true
autorestart=true
# only considers agent started after 10s
startsecs=10
# restart indefinetly
startretries=999999999999
stopwaitsecs=%(ENV_pod_stop_timeout)s
killasgroup=true
# https://www.postgresql.org/docs/9.4/server-shutdown.html
stopsignal=TERM
[program:uwsgi]
# if error then wait 10s before next retry
command=/supervisor/%(ENV_P_UWSGI_POLICY)s_10.sh /uwsgi.sh uwsgi ${P_UWSGI_CMD}
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stdout
stderr_logfile_maxbytes=0
autostart=true
autorestart=true
# only considers agent started after 10s
startsecs=10
# restart indefinetly
startretries=999999999999
stopwaitsecs=%(ENV_pod_uwsgi_stop_timeout)s
killasgroup=true
LinJinghua commented 2 years ago

I have some guesses, I don't know if they are correct. Can you help take a look? @mnaberez

Some cases of supervisord 100% CPU in Linux may be caused by the reuse of file descriptors.

Supervisorctl uses socket to communicate with supervisord. Usually the socket file descriptor in supervisord is lowest-numbered file descriptor not currently open for the process. Soon supervisorctl quit and the socket file descriptor is closed. But combined_map has added this fd(combined_map.update(socket_map)) and register in poller(self.options.poller.register_readable(fd)). The code is https://github.com/Supervisor/supervisor/blob/3.4.0/supervisor/supervisord.py#L185..L208 .

If supervisorctl ask for restart process, and the process will open a log file as stdout. At this point, this log file descriptor(fd) is most likely the one held by the previous socket. Because this file descriptor is valid, it will never be removed by _ignore_invalid( eventmask & select.POLLNVAL will always be false). PollPoller.readables will always container this file descriptor, and poll for read. But at this time, combined_map no longer containers this file descriptor( dispatcher.del_channel will delete it from socket_map ). The event corresponding to this file descriptor will never be processed.

In the loop, poll returns quickly(because there is really data to process) and CPU increased to 100%(busy wait).

I get this a lot in supervisor 3.4.0. Maybe the master branch also has this problem.

Here are some traces:

~ supervisord --version
3.4.0
~ cat /etc/supervisord.d/subprocess.ini
[program:subprocess]
command=bash -c "while true; do date; ls -lh /proc/self/fd; sleep 1; done"
autostart=true
autorestart=unexpected
startsecs=0
exitcodes=0
redirect_stderr=true
stdout_logfile_maxbytes=128KB
stdout_logfile_backups=0
stderr_logfile_maxbytes=128KB
stderr_logfile_backups=0
...

~ ls -lh /proc/`supervisorctl pid`/fd
total 0
lrwx------ 1 root root 64 Jul 20 17:36 0 -> /dev/null
l-wx------ 1 root root 64 Jul 20 17:36 1 -> pipe:[64823090]
l-wx------ 1 root root 64 Jul 20 17:37 13 -> pipe:[244841693]
lr-x------ 1 root root 64 Jul 20 18:07 18 -> pipe:[244841694]
l-wx------ 1 root root 64 Jul 20 17:36 2 -> pipe:[64823091]
l-wx------ 1 root root 64 Jul 20 17:37 3 -> /var/tmp/log/supervisor/supervisord.log
lrwx------ 1 root root 64 Jul 20 17:37 4 -> socket:[64899115]
lr-x------ 1 root root 64 Jul 20 17:37 7 -> /dev/urandom
l-wx------ 1 root root 64 Jul 20 17:37 9 -> /tmp/subprocess-stdout---supervisor-ChnhnK.log
...

~ strace -Ttt -f -p `supervisorctl pid`
strace: Process 96368 attached
strace: [ Process PID=96368 runs in x32 mode. ]
strace: [ Process PID=96368 runs in 64 bit mode. ]
17:53:54.026947 poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=9, events=POLLIN|POLLPRI|POLLHUP}, {fd=18, events=POLLIN|POLLPRI|POLLHUP}], 3, 1000) = 1 ([{fd=9, revents=POLLIN}]) <0.000063>
17:53:54.027209 wait4(-1, 0x7ffc085cdd44, WNOHANG, NULL) = 0 <0.000050>
17:53:54.027413 poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=9, events=POLLIN|POLLPRI|POLLHUP}, {fd=18, events=POLLIN|POLLPRI|POLLHUP}], 3, 1000) = 1 ([{fd=9, revents=POLLIN}]) <0.000040>
17:53:54.027504 wait4(-1, 0x7ffc085cdd44, WNOHANG, NULL) = 0 <0.000013>
17:53:54.027585 poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=9, events=POLLIN|POLLPRI|POLLHUP}, {fd=18, events=POLLIN|POLLPRI|POLLHUP}], 3, 1000) = 1 ([{fd=9, revents=POLLIN}]) <0.000011>

CPU usage of supervisord increased to 100% usually occurs after supervisorctl retart subprocess. It is generally impossible for the log file(fd=9) to be registered to PollPoller. I guess it was introduced after the socket descriptor was reused.

justinpryzby commented 1 year ago

Evidence that the FDs are getting confused: I found this in a stderr logfile of one of our processes.

POST /RPC2 HTTP/1.1
Host: localhost
Accept-Encoding: identity
User-Agent: Python-xmlrpc/3.8
Content-Type: text/xml
Accept: text/xml
Content-Length: 122

<?xml version='1.0'?>
<methodCall>
<methodName>supervisor.getAllProcessInfo</methodName>
<params>
</params>
</methodCall>
POST /RPC2 HTTP/1.1
Host: localhost
Accept-Encoding: identity
User-Agent: Python-xmlrpc/3.8
Content-Type: text/xml
Accept: text/xml
Content-Length: 122

<?xml version='1.0'?>
<methodCall>
<methodName>supervisor.getAllProcessInfo</methodName>
<params>
</params>
</methodCall>

The process was running, but not able to write to stdout - in its logging output, I saw: BrokenPipeError: [Errno 32] Broken pipe

mnaberez commented 1 year ago

A pull request has been submitted to fix the high CPU usage: #1581

If you are experiencing this issue and are able to try the patch, please do so and send your feedback.

justinpryzby commented 1 year ago

Thanks! We saw no issues since I deployed this to our customers in April. But just now I see a process stuck writing to its stderr, which supervisor is failing to read from. This may be a separate independent error than the one that was fixed...

It seems like it's lost track of its FDs, probably during log rotation. 2023-06-02 23:47:00,246 CRIT uncaptured python exception, closing channel <POutputDispatcher at 139775577852032 for <Subprocess at 139775577618176 with name xxxxxxxxx in state RUNNING> (stdout)> (<class 'OSError'>:[Errno 9] Bad file descriptor [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/supervisord.py|runforever|218] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/dispatchers.py|handle_read_event|281] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/dispatchers.py|record_output|215] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/dispatchers.py|_log|184] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|info|327] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|log|345] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|emit|227] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|doRollover|264])

lseek(5, 0, SEEK_CUR) = -1 EBADF (Bad file descriptor) write(5, "\n>>Sat Jun 3 00:53:57 2023%a\n-E"..., 4079) = -1 EBADF (Bad file descriptor)

$ sudo ls -l /proc/26111/fd/5 ls: cannot access /proc/26111/fd/5: No such file or directory

justinpryzby commented 1 year ago

Here's an (other?) error that I found in a stderr logfile for an application running under supervisor.

Exception ignored in: <function tail_f_producer.del at 0x7f92f5918f70> Traceback (most recent call last): File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/http.py", line 649, in del self._close() File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/http.py", line 675, in _close self.file.close() OSError: [Errno 9] Bad file descriptor Exception ignored in: <function tail_f_producer.del at 0x7f92f5918f70>

This is with the patch from [2a93d6b]. To be clear, I think this is a separate problem than the one that was fixed (and I've got no reason to think that the patch caused this issue). Maybe this issue should be closed and a new one created for other FD issues...

Edit: I also observed this on an instance with v4.2.5 without the patch applied.

justinpryzby commented 1 year ago

I'm observing a process currently using 100% CPU use doing poll() with (I checked) the patch applied.

Currently used ~27hr CPU time root 2533 1 5 jun21 ? 1-03:22:30 /usr/bin/python3 /usr/local/bin/supervisord -c /home/telsasoft/server/supervisord/supervisord.conf

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=33, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}, {fd=60, events=POLLIN|POLLPRI|POLLHUP}, {fd=63, events=POLLIN|POLLPRI|POLLHUP}, {fd=66, events=POLLIN|POLLPRI|POLLHUP}, {fd=69, events=POLLIN|POLLPRI|POLLHUP}, {fd=71, events=POLLIN|POLLPRI|POLLHUP}, {fd=57, events=POLLIN|POLLPRI|POLLHUP}, {fd=34, events=POLLIN|POLLPRI|POLLHUP}, {fd=37, events=POLLIN|POLLPRI|POLLHUP}, {fd=52, events=POLLIN|POLLPRI|POLLHUP}, {fd=51, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=49, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=54, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=14, events=POLLIN|POLLPRI|POLLHUP}, {fd=19, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=28, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=29, events=POLLIN|POLLPRI|POLLHUP}], 27, 1000) = 1 ([{fd=29, revents=POLLNVAL}]) wait4(-1, 0x7ffdd4f1cf20, WNOHANG, NULL) = 0 poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=33, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}, {fd=60, events=POLLIN|POLLPRI|POLLHUP}, {fd=63, events=POLLIN|POLLPRI|POLLHUP}, {fd=66, events=POLLIN|POLLPRI|POLLHUP}, {fd=69, events=POLLIN|POLLPRI|POLLHUP}, {fd=71, events=POLLIN|POLLPRI|POLLHUP}, {fd=57, events=POLLIN|POLLPRI|POLLHUP}, {fd=34, events=POLLIN|POLLPRI|POLLHUP}, {fd=37, events=POLLIN|POLLPRI|POLLHUP}, {fd=52, events=POLLIN|POLLPRI|POLLHUP}, {fd=51, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=49, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=54, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=14, events=POLLIN|POLLPRI|POLLHUP}, {fd=19, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=28, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=29, events=POLLIN|POLLPRI|POLLHUP}], 27, 1000) = 1 ([{fd=29, revents=POLLNVAL}]) wait4(-1, 0x7ffdd4f1cf20, WNOHANG, NULL) = 0 poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=33, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}, {fd=60, events=POLLIN|POLLPRI|POLLHUP}, {fd=63, events=POLLIN|POLLPRI|POLLHUP}, {fd=66, events=POLLIN|POLLPRI|POLLHUP}, {fd=69, events=POLLIN|POLLPRI|POLLHUP}, {fd=71, events=POLLIN|POLLPRI|POLLHUP}, {fd=57, events=POLLIN|POLLPRI|POLLHUP}, {fd=34, events=POLLIN|POLLPRI|POLLHUP}, {fd=37, events=POLLIN|POLLPRI|POLLHUP}, {fd=52, events=POLLIN|POLLPRI|POLLHUP}, {fd=51, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=49, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=54, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=14, events=POLLIN|POLLPRI|POLLHUP}, {fd=19, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=28, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=29, events=POLLIN|POLLPRI|POLLHUP}], 27, 1000) = 1 ([{fd=29, revents=POLLNVAL}])

There's nothing in the supervisor logfile. Let me know if I can collect more info somehow.

Jagadeeshftw commented 1 year ago

Thanks! We saw no issues since I deployed this to our customers in April. But just now I see a process stuck writing to its stderr, which supervisor is failing to read from. This may be a separate independent error than the one that was fixed...

It seems like it's lost track of its FDs, probably during log rotation. 2023-06-02 23:47:00,246 CRIT uncaptured python exception, closing channel <POutputDispatcher at 139775577852032 for <Subprocess at 139775577618176 with name xxxxxxxxx in state RUNNING> (stdout)> (<class 'OSError'>:[Errno 9] Bad file descriptor [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/supervisord.py|runforever|218] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/dispatchers.py|handle_read_event|281] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/dispatchers.py|record_output|215] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/dispatchers.py|_log|184] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|info|327] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|log|345] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|emit|227] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|doRollover|264])

lseek(5, 0, SEEK_CUR) = -1 EBADF (Bad file descriptor) write(5, "\n>>Sat Jun 3 00:53:57 2023%a\n-E"..., 4079) = -1 EBADF (Bad file descriptor)

$ sudo ls -l /proc/26111/fd/5 ls: cannot access /proc/26111/fd/5: No such file or directory

Did you resolve this Python exception error? I'm also facing the same issue. [velocix@ac-0 cachelogic]$ supervisord --version 4.2.5 [velocix@ac-0 cachelogic]$ tail -1 /var/log/supervisor/supervisord.log 2023-09-13 18:00:08,052 CRIT uncaptured python exception, closing channel <POutputDispatcher at 140414136427688 for <Subprocess at 140414136085808 with name vxsnmpcdnselfcheck in state RUNNING> (stderr)> (<class 'OSError'>:[Errno 29] Illegal seek [/usr/local/lib/python3.6/site-packages/supervisor/supervisord.py|runforever|218] [/usr/local/lib/python3.6/site-packages/supervisor/dispatchers.py|handle_read_event|276] [/usr/local/lib/python3.6/site-packages/supervisor/dispatchers.py|record_output|210] [/usr/local/lib/python3.6/site-packages/supervisor/dispatchers.py|_log|179] [/usr/local/lib/python3.6/site-packages/supervisor/loggers.py|info|327] [/usr/local/lib/python3.6/site-packages/supervisor/loggers.py|log|345] [/usr/local/lib/python3.6/site-packages/supervisor/loggers.py|emit|227] [/usr/local/lib/python3.6/site-packages/supervisor/loggers.py|doRollover|264]) [velocix@ac-0 cachelogic]$

justinpryzby commented 1 year ago

On Wed, Sep 13, 2023 at 11:12:23AM -0700, Jagadeesh B wrote:

Did you resolve this Python exception error? I'm also facing the same issue.

I didn't resolve it, no.

@.* cachelogic]$ supervisord --version 4.2.5**

Is that with the patch applied (or git HEAD) or unpatched 4.2.5 ?

Jagadeeshftw commented 1 year ago

@justinpryzby

I was experiencing this issue with Supervisor, and to address it, I attempted to upgrade Supervisor from version 4.2.2 to the latest version(4.2.5) using the command:

pip3 install --upgrade supervisor

However, after upgrading to the latest version, the issue I encountered still persists. I checked the issue section and found suggestions that upgrading the Supervisor would resolve it, but unfortunately, it didn't work in my case.

I'm not sure about whether this version is patch-applied or not.

If you have any insights or suggestions on how to resolve this issue, please feel free to share.

1602

mouday commented 7 months ago

yum install, version is 3.4.0

justinpryzby commented 6 months ago

yum install, version is 3.4.0

What do you mean when you say "yum install..." ? Did you mean to say that you hit the issue with that version or ??

hedgedawg commented 3 months ago

A pull request has been submitted to fix the high CPU usage: #1581

If you are experiencing this issue and are able to try the patch, please do so and send your feedback.

Thanks @mnaberez for the recommendation.

I had been seeing the issue intermittently where CPU usage spikes to 100%, and I've been able to replicate it after a few attempts of concurrent supervisorctl restart commands for multiple processes managed by supervisor.

Here are my observations, pre and post patch:

Hopefully this helps, as I understand there is some demand for a new version including this change - ref #1635.

Details of the patch:

index 0a4f3e6..2265db9 100755
--- a/supervisor/supervisord.py
+++ b/supervisor/supervisord.py
@@ -222,6 +222,14 @@ class Supervisor:
                         raise
                     except:
                         combined_map[fd].handle_error()
+                else:
+                    # if the fd is not in combined_map, we should unregister it. otherwise,
+                    # it will be polled every time, which may cause 100% cpu usage
+                    self.options.logger.warn('unexpected read event from fd %r' % fd)
+                    try:
+                        self.options.poller.unregister_readable(fd)
+                    except:
+                        pass

             for fd in w:
                 if fd in combined_map:
@@ -237,6 +245,12 @@ class Supervisor:
                         raise
                     except:
                         combined_map[fd].handle_error()
+                else:
+                    self.options.logger.warn('unexpected write event from fd %r' % fd)
+                    try:
+                        self.options.poller.unregister_writable(fd)
+                    except:
+                        pass

             for group in pgroups:
                 group.transition()
andreigabreanu commented 3 months ago

Hello, It seems there is still an issue with the high CPU, not sure if related to this ticket. We're on version: 4.2.5-1 - Debian 12. After a restart, the CPU is between 0 and 20% (normal usage). It happens every few weeks, where the supervisor process goes to 100% cpu and remains there.

justinpryzby commented 2 months ago

More evidence of FD confusion. TelsaSupervisor> reload Really restart the remote supervisord process y/N? y error: <class 'http.client.BadStatusLine'>, 2024-07-26 14:34:46 UDP: [10.254.1.52]:55053->[10.28.1.45]:162 [UDP: [10.254.1.52]:55053->[10.28.1.45]:162]: : file: /usr/lib64/python3.6/http/client.py line: 289

Edit: I should've added that this instance does not have the patch applied.