NordicHPC / sonar

Tool to profile usage of HPC resources by regularly probing processes using ps.
GNU General Public License v3.0
8 stars 5 forks source link

User name reported as "_noinfo_" repeatedly #144

Closed lars-t-hansen closed 7 months ago

lars-t-hansen commented 7 months ago

This is a new problem on ML6 - just before 2AM on 21 February a lot of python+perl processes were reported to use the CPU heavily and to be run by user _noinfo_, which means a UID that is not in the passwd database. In order to allow problems like this to be diagnosed we should log the UID as part of the _noinfo_ string.

lars-t-hansen commented 7 months ago

Grubbing through the source and github for the users library that we use for this, I find that it is no longer maintained (https://github.com/ogham/rust-users), plus it probably does not handle all possible errors as smoothly as it should. It does retry for ERANGE (not enough buffer space) but not ENOMEM (insufficient memory to allocate passwd structure), which could be transient, along with several other cases, see man page for getpwuid_r. Given the load this machine was under last night we could simply have run into a transient failure due to resource exhaustion.

There's a maintained fork of this library called uzers (see https://github.com/ogham/rust-users/issues/54) but given our very simple needs and the MIT license of the code we might be better off just lifting what code we need and maintaining that ourselves.

bast commented 7 months ago

I agree that we better lift out the part that we need, with attribution, for easier maintainability.

lars-t-hansen commented 6 months ago

A couple of possible reasons why we can't map UID to username: