Epoll failure on interrupted system call

blechschmidt / massdns

A high-performance DNS stub resolver for bulk lookups and reconnaissance (subdomain enumeration)

GNU General Public License v3.0

3.18k stars 469 forks source link

Epoll failure on interrupted system call #134

Open suola opened 2 years ago

suola commented 2 years ago

I ran massdns for several hours to resolve large number of domains, and it quit prematurely with

DEBUG Epoll failure: Interrupted system call

This comes from https://github.com/blechschmidt/massdns/blob/master/src/main.c#L2013.

I'm not expert on this, but maybe epoll_wait() failures due to signal interrupts could be ignored?

At least that the understanding I got from e.g. https://stackoverflow.com/a/6870391

blechschmidt commented 2 years ago

epoll_wait failures due to signal interrupts only cause this message to be printed. They do not terminate massdns. Since there is no change in state when epoll gets interrupted, the loop will just be re-entered and epoll_wait will be called again.

Did you see the error message only once or did it occur repeatedly? Is there maybe any hint in dmesg that the OOM killer killed massdns?

suola commented 2 years ago

I run massdns with --processes 4, and all the 4 processes printed that message once and exited.

The reason why I suspect this was a premature exit, is that 3 out of 4 processes had already finished (their CPU usage was zero), and the size of one of the output files was 25% smaller than the three other, which were basically equal in size.

I'm not sure whether this is relevant for this issue, but another observation: The reported massdns progress reached 100% quite a long time before the processes exited. This has occurred several time earlier too, but the assumedly premature exit with the Epoll-warning has occurred only once.

suola commented 2 years ago

I took a log at the logs, and there were no signs of OOM killer.

However, I noticed that the timestamp of log message was incorrect (I redirect massdns logs to my common log file, and all the messages were buffered and flushed on exit).

Therefore, this issue is a user error. The problem described in the previous message is real, though.