Closed nbacking closed 2 years ago
munged
appears to be running. From your output above, you are able to encode and decode credentials on host "master01".
The munged
socket is created when the daemon starts, and removed when the daemon gracefully terminates. The default location of the socket is listed in the munged --help
message for the --socket
option (shown in brackets). For example:
$ /usr/sbin/munged --help | grep socket=
-S, --socket=PATH Specify local socket [/run/munge/munge.socket.2]
$ /usr/sbin/munged --help | sed -ne '/socket=/ s/.*\[\(.*\)\]/\1/p'
/run/munge/munge.socket.2
The above errors for sbatch
and squeue
(Failed to access "/var/run/munge/munge.socket.2": No such file or directory
) appear to show that munged
is not running on the host that invoked sbatch
and squeue
. Check if munged
is running on that host as well. If it is running, you should see the socket /var/run/munge/munge.socket.2
.
munged
needs to be running on all nodes in the cluster, and its key file will need to be securely copied to all nodes as well.
Regarding this issue's title (Failed to query password file entry for "user"
), the following message can be generated by munged
:
Info: Failed to query passwd file for "foo": User not found
This is an informational message that occurs when the /etc/group
file contains a group to which user "foo" belongs, but user "foo" is not listed in the /etc/passwd
file.
Thanks, that worked the issue was not on the head node, but the cluster node...just a coincidence that I rebooted the cluster at the same time this happened which was confusing but I found the issues on the graphical node and as soon as i restarted the service I was good as gold. thanks for the help.
I rebooted a cluster head node yesterday, using slurm this uses munge to authenticate. Last night started getting error messages when submitting jobs. I am very new to all of this but after reading a lot last night I was able to verify it looks like its running (ouput below). Then I tested with the systemctl status -l munge says it's running, then tried running munge -n | unmunge this was working.
but when job submission starts i still am getting the error: any help would be apprecated.
If munged is up, restart with --num-threads=10 Munge encode failed: Unable to access "/var/run/munge/munge.socket.2": No such file or directory authentication: invalid authentication credential batch job submission failed: protocol authentication error
[root@master01 munge]# systemctl status -l munge ● munge.service - MUNGE authentication service Loaded: loaded (/usr/lib/systemd/system/munge.service; static; vendor preset: disabled) Active: active (running) since Fri 2022-08-26 23:11:05 EDT; 9h ago Docs: man:munged(8) Process: 41542 ExecStart=/usr/sbin/munged --force (code=exited, status=0/SUCCESS) Main PID: 41544 (munged) Tasks: 4 CGroup: /system.slice/munge.service └─41544 /usr/sbin/munged --force
Aug 26 23:11:05 master01 systemd[1]: Starting MUNGE authentication service... Aug 26 23:11:05 master01 systemd[1]: Started MUNGE authentication service. [root@master01 munge]# munge -n | unmunge STATUS: Success (0) ENCODE_HOST: master01.cm.cluster (10.141.255.254) ENCODE_TIME: 2022-08-27 08:50:40 -0400 (1661604640) DECODE_TIME: 2022-08-27 08:50:40 -0400 (1661604640) TTL: 300 CIPHER: aes128 (4) MAC: sha256 (5) ZIP: none (0) UID: root (0) GID: root (0) LENGTH: 0