Closed aftersnow closed 1 year ago
Thanks. This looks promising, and I've deployed it a few places to test.
The warnings you added my be as important as the fix itself - if we had that 3 years ago, the bug probably would've been a lot more obvious. Are there any other warnings that should be added ? For example, what about when an FD that isn't a pipe ends up on the list of FDs to be polled, as I've also seen. I suspect there are more bugs with FDs, which may be rare and hard to hit, but it'll be amply easy to address them if logs are added to warn about inconsistencies.
I didn't understand what did you meant when you said "trigger the supervisor's subprocess to rotate the log" ? Do you mean by connecting to that process and causing it to write an adequately large logs to stdout ? Is it possible to consistently make the bug easier to hit by injecting a "sleep" command ?
Thanks. This looks promising, and I've deployed it a few places to test.
The warnings you added my be as important as the fix itself - if we had that 3 years ago, the bug probably would've been a lot more obvious. Are there any other warnings that should be added ? For example, what about when an FD that isn't a pipe ends up on the list of FDs to be polled, as I've also seen. I suspect there are more bugs with FDs, which may be rare and hard to hit, but it'll be amply easy to address them if logs are added to warn about inconsistencies.
Yes, more warnings is needed, but what important is we need to unregister the FD from polling list after each event is handled (if it's need to), instead of unregistering FD by _ignore_invalid().
I didn't understand what did you meant when you said "trigger the supervisor's subprocess to rotate the log" ? Do you mean by connecting to that process and causing it to write an adequately large logs to stdout ? Is it possible to consistently make the bug easier to hit by injecting a "sleep" command ?
Yes, but the faster method is to use a shared variable or signal to reproduce:
Thanks, @aftersnow
I'm glad I will be able to remove the sleep patch I added to the continuous loop I was sufferings from with supervisor,
I had to do it specifically for production environment to avoid 100% CPU for a long time, as some processes could be waiting 30 min before quitting.
Since the real fix came a couple of years later, this quick sleep patch has saved us some CPU, and money meanwhile.
So thanks again for allowing me to remove this once and for all !
import time
def poll(self, timeout):
fds = self._poll_fds(timeout)
# avoids 100% CPU when program is in STOPPING state
# and client supervisorctl write socket open waiting
time.sleep(1)
readables, writables = [], []
for fd, eventmask in fds:
# that's method which you mentioned had a flaw
# that if fd is reused before _ignore_invalid()
if self._ignore_invalid(fd, eventmask):
continue
if eventmask & self.READ:
readables.append(fd)
if eventmask & self.WRITE:
writables.append(fd)
return readables, writables
I marked the comment with the sleep
patch as outdated to make sure nobody confuses it with this patch.
I've deployed this change to customers and saw no issues since last month. Thanks to @aftersnow for diagnosing the issue.
Hello, I deployed also this change and I confirm it fixes the issue.
Do you know when this will be merged ?
We found a problem of high CPU usage of the supervisor. We believe it's same reason for #807 . This problem is caused by continuous polling of a wrong fd in the main loop of the supervisor. Busy polling leads to a CPU usage close to 100%. (We can confirm this problem through the strace tool)
This issue can be reproduced by:
The reason for the problem is that supervisor relies on using _ignore_invalid() in the main loop to close fds. This method has a flaw that if fd is reused before _ignore_invalid() is called, then the fd may always exist in the fd list of poll .
This commit fixes the problem. By checking the validity of the fd in the event list in the main loop, if the fd is not in the combined_map, it is considered to be an invalid fd and will be removed from the list.