fail2ban / fail2ban

Daemon to ban hosts that cause multiple authentication errors
http://www.fail2ban.org
Other
12.34k stars 1.26k forks source link

[BR]: ValueError: filedescriptor out of range in select() #3391

Open Jajcus opened 2 years ago

Jajcus commented 2 years ago

Environment:

Custom jail configuration monitoring very active systemd journal. LimitNOFILE=10240 set for fail2ban.service

The issue:

Our system logs a lot to systemd journal. To make this manageable we have limited journal file size, so there are a lot of 'small' (over 200MB0 journal files. fail2ban opens all of them, as many times as there are configured jails using journal. This is not yet the problem, but limitation of libsystemd (no way to reliably open only the current journal file), as I understand.

Without adjusting LimitNOFILE fail2ban would crash due to too many files open, as fail2ban opens more than 1024 files. Increasing the limit (to 10240) should fix the problem, but instead fail2ban crashes with:

Oct 25 14:19:27 machine fail2ban.asyncserver[493163]: ERROR filedescriptor out of range in select()
                                                   Traceback (most recent call last):
                                                     File "/usr/lib/python3/dist-packages/fail2ban/server/asyncserver.py", line 161, in loop
                                                       poll(timeout)
                                                     File "/usr/lib/python3.7/asyncore.py", line 144, in poll
                                                       r, w, e = select.select(r, w, e, timeout)
                                                   ValueError: filedescriptor out of range in select()
Oct 25 14:19:27 machine fail2ban.asyncserver[493163]: ERROR Too many errors - stop logging connection errors
Oct 25 14:19:27 machine fail2ban.asyncserver[493163]: CRITICAL Too many errors - critical count reached {'accept': 0, 'listen': 1001}

This is because fail2ban code forces asyncore module to use the outdated select() call for watching file description. This won't work on Linux for anything more than 1024 open files.

Browsing the code and commit history suggests that 'use_poll' setting for asyncore was considered, but disables for modern Python versions for some reason. I guess it needs to be reconsidered. select() is outdated.

Steps to reproduce

Have more than 1024 systemd journal files and a fail2ban jail set to use journal. Or over 512 files and two such jails configured.

Expected behavior

Everything works provided the file limit for fail2ban-server process is set high enough.

Observed behavior

fail2ban crashes with: ValueError: filedescriptor out of range in select()

sebres commented 2 years ago

To make this manageable we have limited journal file size, so ...

So the journal files will be rotated on your system, right? I'm not about the current issue you have (I must still understand or else reproduce it to follow up that)... I'm rather about another issue #3396, which shows that systemd backend seems to be not really suitable for fail2ban usage (at least together with rotation) at the moment... So how it works on your side at all?.. Do you have some newer version of python's systemd module? (I got it reproduced (no entries read after journal rotation) with 234-3+b4 and 234-2+b1).

jcumming commented 9 months ago

I just ran across this; I added another nixos-container, and that seems to have caused fail2ban to start crashing.

lsof reports that the systemd backend is opening each container's logfiles -- there were ~1500 file descriptors open.

I applied this patch that enables the poll2 implementation:

diff --git a/fail2ban/server/asyncserver.py b/fail2ban/server/asyncserver.py
index 0c36d846..79c7cbe3 100644
--- a/fail2ban/server/asyncserver.py
+++ b/fail2ban/server/asyncserver.py
@@ -244,7 +244,7 @@ class AsyncServer(asyncore.dispatcher):
        # @param sock: socket file.
        # @param force: remove the socket file if exists.

-       def start(self, sock, force, timeout=None, use_poll=False):
+       def start(self, sock, force, timeout=None, use_poll=True):
                self.__worker = threading.current_thread()
                self.__sock = sock
                # Remove socket

... and started it. I'll update this issue with results...

sebres commented 9 months ago

Don't understand how it should help, if the real error is clearly too many open file descriptors by systemd journal monitoring. The error by poll in asyncserver loop is surely an after effect.

To solve the initial issue either one should restrict journal files/paths by systemd backend or to increase nofile limits. Alternatively one could use rsyslog (parallel to systemd journal) and switch jail(s) backend to auto to monitor log-files instead of journal.

jcumming commented 8 months ago

From the select(2) man page:

DESCRIPTION WARNING: select() can monitor only file descriptors numbers that are less than FD_SETSIZE (1024)—an unreasonably low limit for many modern applications—and this limitation will not change. All modern applications should instead use poll(2) or epoll(7), which do not suffer this limitation.

... continuing to use the select() backend will cap the number of logfiles that can be monitored to 1024

Some of the other stuff I tried:

The systemd unit has LimitNOFILE = 65536.

My initial attempts to restrict the systemd backend were ineffective; the backend still opened all of the container systemd logfiles too. Doing a lsof on the fail2ban showed that it had ~1500 file descriptors open.

I might have made a few mistakes; I'll revisit by adding something like backend = systemd[journalfiles="/var/log/journal/*.something/system.journal"] to the jail configs