Atoptool / atop

System and process monitor for Linux
GNU General Public License v2.0
792 stars 109 forks source link

Atop service may not start on high core count boxes #304

Closed ryanbowen closed 2 months ago

ryanbowen commented 3 months ago

This was a fun one to debug... it seems that atop has 2 * $CPUS + 30(ish) open file descriptors when run as root. When running atop as service on high core count boxes this has the potential to put it over the default limit set by systemd of 1024 which causes it to not start.

It tends to present as a failure to open /proc/loadavg which I'm guessing is the first place that a failed open is fatal:

May 10 12:27:01 host01 systemd[1]: Starting Atop advanced performance monitor...
May 10 12:27:01 host01 systemd[1]: Started Atop advanced performance monitor.
May 10 12:27:01 host01 sh[1533798]: can not open /proc/loadavg
May 10 12:27:01 host01 systemd[1]: atop.service: Main process exited, code=exited, status=53/n/a
May 10 12:27:01 host01 systemd[1]: atop.service: Failed with result 'exit-code'.

For reference, on this host it breaks:

root@host01:~# lscpu | grep '^CPU(s)'
CPU(s):              512
root@host01(toa):~# lsof -p `pgrep -x atop` | wc -l
1055

On this one it's fine:

root@host02:~# lscpu | grep '^CPU(s):'
CPU(s):              64
root@host02(psc|qa):~$ lsof -p `pgrep -x atop --newest` | wc -l
159
sreerajkksd commented 3 months ago

We can update the systemd configuration in https://github.com/Atoptool/atop/blob/master/atop.service to include:

LimitNOFILE=4096
Atoptool commented 2 months ago

Great debugging! Atop will automatically increase the number of allowed open files now to the limit which is needed for the current number of CPUs. This solution is preferred above setting a fixed number in the atop.service file which would introduce another (higher) limit again. Besides, this will not solve this issue for an interactive run.