Atoptool / atop

System and process monitor for Linux
GNU General Public License v2.0
783 stars 109 forks source link

broken atopacct blocks atop indefinetely #207

Open Zugschlus opened 1 year ago

Zugschlus commented 1 year ago

Hi,

when something goes wrong in atopacct, it keeps a system-wide semaphore which causes subsequent calls to atop to stall indefinetely in

getuid()                                = 1000
setresuid(-1, 1000, -1)                 = 0
semtimedop(1, [{0, -1, SEM_UNDO}, {1, -1, SEM_UNDO}], 2, NULL

This happens when the debian package is installed on a s390x system. Unfortunately, I don't have root on that system and can therefore not see what atopacct does when it happens. The other arches Debian builds for are fine.

Therefore, this issues has two parts:

  1. atopacct should not block the semaphore on s390x systems
  2. atop itself should time out and terminate with a meaningful error message if it cannot obtain the semaphore
Zugschlus commented 1 year ago

Due to this issue, atop will be removed from Debian testing next week.

gleventhal commented 1 year ago

Have you tried clearing atopacct state rto resolve the issue? Something like: mv /var/run/pacct_shadow.d{,.orig} && systemctl start atopacct

Zugschlus commented 1 year ago

The main problem is that I don't see this behavior on any box I have immediate shell access to. I cannot try anything there short of writing a test case, build that test case into an official package and upload this package to Debian. I'd really like to avoid that.

The real showstopper is that atop waits indefinetly and silently for the semaphore until the test is aborted with a timeout. As I wrote in the original bug report, we have two problems there that should both be addressed.

Marc

Atoptool commented 1 year ago

Part 2 of the issue has been solved: atop times out after waiting 3 seconds for the semaphore and then continues without process accounting.

Atoptool commented 1 year ago

I do not understand part 1 of the issue: in between the claiming of the semaphore in atopacctd and releasing it there are no blocking calls. Even if atopacctd would terminate after claiming the semaphore, the SEM_UNDO flag takes care of releasing the semaphore automatically.

Atoptool commented 1 year ago

Is it possible for you to gain root privileges on the test system to issue a system call trace with strace to see where atopacctd blocks?

Zugschlus commented 1 year ago

I currently dont have even shell access to the (only) test box that shows the behavior. I'm trying to find out whether atop 2.8.1 passes the test as it's really tight timing to get atop back into Debian testing (Debian is planning to freeze). I apologize for not having this prioritized properly.