dun / munge

MUNGE (MUNGE Uid 'N' Gid Emporium) is an authentication service for creating and validating user credentials.
GNU Lesser General Public License v3.0
250 stars 46 forks source link

Failed to query password file entry for "user" #124

Closed nbacking closed 2 years ago

nbacking commented 2 years ago

I rebooted a cluster head node yesterday, using slurm this uses munge to authenticate. Last night started getting error messages when submitting jobs. I am very new to all of this but after reading a lot last night I was able to verify it looks like its running (ouput below). Then I tested with the systemctl status -l munge says it's running, then tried running munge -n | unmunge this was working.

but when job submission starts i still am getting the error: any help would be apprecated.

If munged is up, restart with --num-threads=10 Munge encode failed: Unable to access "/var/run/munge/munge.socket.2": No such file or directory authentication: invalid authentication credential batch job submission failed: protocol authentication error

[root@master01 munge]# systemctl status -l munge ● munge.service - MUNGE authentication service Loaded: loaded (/usr/lib/systemd/system/munge.service; static; vendor preset: disabled) Active: active (running) since Fri 2022-08-26 23:11:05 EDT; 9h ago Docs: man:munged(8) Process: 41542 ExecStart=/usr/sbin/munged --force (code=exited, status=0/SUCCESS) Main PID: 41544 (munged) Tasks: 4 CGroup: /system.slice/munge.service └─41544 /usr/sbin/munged --force

Aug 26 23:11:05 master01 systemd[1]: Starting MUNGE authentication service... Aug 26 23:11:05 master01 systemd[1]: Started MUNGE authentication service. [root@master01 munge]# munge -n | unmunge STATUS: Success (0) ENCODE_HOST: master01.cm.cluster (10.141.255.254) ENCODE_TIME: 2022-08-27 08:50:40 -0400 (1661604640) DECODE_TIME: 2022-08-27 08:50:40 -0400 (1661604640) TTL: 300 CIPHER: aes128 (4) MAC: sha256 (5) ZIP: none (0) UID: root (0) GID: root (0) LENGTH: 0

image

dun commented 2 years ago

munged appears to be running. From your output above, you are able to encode and decode credentials on host "master01".

The munged socket is created when the daemon starts, and removed when the daemon gracefully terminates. The default location of the socket is listed in the munged --help message for the --socket option (shown in brackets). For example:

$ /usr/sbin/munged --help | grep socket=
  -S, --socket=PATH         Specify local socket [/run/munge/munge.socket.2]

$ /usr/sbin/munged --help | sed -ne '/socket=/ s/.*\[\(.*\)\]/\1/p'
/run/munge/munge.socket.2

The above errors for sbatch and squeue (Failed to access "/var/run/munge/munge.socket.2": No such file or directory) appear to show that munged is not running on the host that invoked sbatch and squeue. Check if munged is running on that host as well. If it is running, you should see the socket /var/run/munge/munge.socket.2.

munged needs to be running on all nodes in the cluster, and its key file will need to be securely copied to all nodes as well.

Regarding this issue's title (Failed to query password file entry for "user"), the following message can be generated by munged:

Info: Failed to query passwd file for "foo": User not found

This is an informational message that occurs when the /etc/group file contains a group to which user "foo" belongs, but user "foo" is not listed in the /etc/passwd file.

nbacking commented 2 years ago

Thanks, that worked the issue was not on the head node, but the cluster node...just a coincidence that I rebooted the cluster at the same time this happened which was confusing but I found the issues on the graphical node and as soon as i restarted the service I was good as gold. thanks for the help.