Fix high cpu usage caused by fd leak

aftersnow commented 1 year ago

We found a problem of high CPU usage of the supervisor. We believe it's same reason for #807 . This problem is caused by continuous polling of a wrong fd in the main loop of the supervisor. Busy polling leads to a CPU usage close to 100%. (We can confirm this problem through the strace tool)

This issue can be reproduced by:

Continuously initiate arbitrary requests to supervisor through supervisorctl
After the socket fd is closed, trigger the supervisor's subprocess to rotate the log (or reopen the file)
If the above steps are completed within a single main loop of the supervisor, the problem can be triggered

The reason for the problem is that supervisor relies on using _ignore_invalid() in the main loop to close fds. This method has a flaw that if fd is reused before _ignore_invalid() is called, then the fd may always exist in the fd list of poll .

This commit fixes the problem. By checking the validity of the fd in the event list in the main loop, if the fd is not in the combined_map, it is considered to be an invalid fd and will be removed from the list.

justinpryzby commented 1 year ago

Thanks. This looks promising, and I've deployed it a few places to test.

The warnings you added my be as important as the fix itself - if we had that 3 years ago, the bug probably would've been a lot more obvious. Are there any other warnings that should be added ? For example, what about when an FD that isn't a pipe ends up on the list of FDs to be polled, as I've also seen. I suspect there are more bugs with FDs, which may be rare and hard to hit, but it'll be amply easy to address them if logs are added to warn about inconsistencies.

I didn't understand what did you meant when you said "trigger the supervisor's subprocess to rotate the log" ? Do you mean by connecting to that process and causing it to write an adequately large logs to stdout ? Is it possible to consistently make the bug easier to hit by injecting a "sleep" command ?

aftersnow commented 1 year ago

Thanks. This looks promising, and I've deployed it a few places to test.

The warnings you added my be as important as the fix itself - if we had that 3 years ago, the bug probably would've been a lot more obvious. Are there any other warnings that should be added ? For example, what about when an FD that isn't a pipe ends up on the list of FDs to be polled, as I've also seen. I suspect there are more bugs with FDs, which may be rare and hard to hit, but it'll be amply easy to address them if logs are added to warn about inconsistencies.

Yes, more warnings is needed, but what important is we need to unregister the FD from polling list after each event is handled (if it's need to), instead of unregistering FD by _ignore_invalid().

I didn't understand what did you meant when you said "trigger the supervisor's subprocess to rotate the log" ? Do you mean by connecting to that process and causing it to write an adequately large logs to stdout ? Is it possible to consistently make the bug easier to hit by injecting a "sleep" command ?

Yes, but the faster method is to use a shared variable or signal to reproduce:

When the supervisorctl socket FD is closed, change the variable
When handling POutputDispatcher's read event, the above variable tells the logger to rotate log file, this will cause supervisor to reopen file and get a new FD. The FD must same with the closed socket FD.
Issue reproduced

ollofx commented 1 year ago

Thanks, @aftersnow

I'm glad I will be able to remove the sleep patch I added to the continuous loop I was sufferings from with supervisor,

I had to do it specifically for production environment to avoid 100% CPU for a long time, as some processes could be waiting 30 min before quitting.

Since the real fix came a couple of years later, this quick sleep patch has saved us some CPU, and money meanwhile.

So thanks again for allowing me to remove this once and for all !

    import time
    def poll(self, timeout):
        fds = self._poll_fds(timeout)
        # avoids 100% CPU when program is in STOPPING state
        # and client supervisorctl write socket open waiting
        time.sleep(1)
        readables, writables = [], []
        for fd, eventmask in fds:
            # that's method which you mentioned had a flaw
            # that if fd is reused before _ignore_invalid()
            if self._ignore_invalid(fd, eventmask):
                continue
            if eventmask & self.READ:
                readables.append(fd)
            if eventmask & self.WRITE:
                writables.append(fd)
        return readables, writables

mnaberez commented 1 year ago

I marked the comment with the sleep patch as outdated to make sure nobody confuses it with this patch.

justinpryzby commented 1 year ago

I've deployed this change to customers and saw no issues since last month. Thanks to @aftersnow for diagnosing the issue.

Tomo59 commented 1 year ago

Hello, I deployed also this change and I confirm it fixes the issue.

Do you know when this will be merged ?

Supervisor / supervisor

Fix high cpu usage caused by fd leak #1581