Closed besawn closed 3 years ago
Root cause:
The libcsmpam.so
module is not being called correctly for the session entry in the default /etc/pam.d/sshd
configuration file order. The problem can be resolved by reversing the order of these lines in the file:
[root@c650f99p36 dcgm_2081_regression_tests]# cat /etc/pam.d/sshd | tail -n 2
session include postlogin
session required libcsmpam.so
[root@c650f99p36 dcgm_2081_regression_tests]# vi /etc/pam.d/sshd
[root@c650f99p36 dcgm_2081_regression_tests]# cat /etc/pam.d/sshd | tail -n 2
session required libcsmpam.so
session include postlogin
After the change above:
[root@c650f99p06 c650mnp05]# su - besawn -c "ssh c650f99p36"
Last login: Thu Aug 13 17:04:34 2020 from 10.7.99.6
# Verify that we are now correctly placed in the allocation cgroup
[besawn@c650f99p36 ~]$ cat /proc/self/cgroup | egrep "allocation|csm"
12:cpuset:/allocation_3
9:devices:/allocation_3
3:cpu,cpuacct:/allocation_3
2:memory:/allocation_3
# Run the yes command for 5 seconds on 8 CPUs
[besawn@c650f99p36 ~]$ /usr/bin/seq 8 | /usr/bin/xargs -P0 -n1 timeout 5 yes > /dev/null &
[1] 1409543
# Confirm that the tasks are now on the expected CPUs and in the correct allocation cgroup
[besawn@c650f99p36 ~]$ ps awxf -eo pid,user,psr,cgroup,cmd | egrep "USER|yes" | grep -v "timeout"
PID USER PSR CGROUP CMD
1409545 besawn 70 12:cpuset:/allocation_3,9:d | | \_ yes 1
1409547 besawn 60 12:cpuset:/allocation_3,9:d | | \_ yes 2
1409549 besawn 76 12:cpuset:/allocation_3,9:d | | \_ yes 3
1409551 besawn 36 12:cpuset:/allocation_3,9:d | | \_ yes 4
1409554 besawn 53 12:cpuset:/allocation_3,9:d | | \_ yes 5
1409557 besawn 24 12:cpuset:/allocation_3,9:d | | \_ yes 6
1409558 besawn 22 12:cpuset:/allocation_3,9:d | | \_ yes 7
1409559 besawn 46 12:cpuset:/allocation_3,9:d | \_ yes 8
1409561 besawn 85 12:cpuset:/allocation_3,9:d \_ grep -E --color=auto USER|yes
While debugging, I was able to observe whether libcsmpam.so
was being called by the sshd pam process using the following steps. In the typical recommended configuration where there are two entries for libcsmpam.so
in /etc/pam.d/sshd
, one for account
and one for session
:
[root@c650f99p36 ~]# cat /etc/pam.d/sshd
#%PAM-1.0
auth substack password-auth
auth include postlogin
account required pam_sepermit.so
account required pam_nologin.so
account required libcsmpam.so
account include password-auth
password include password-auth
# pam_selinux.so close should be the first session rule
session required pam_selinux.so close
session required pam_loginuid.so
# pam_selinux.so open should only be followed by sessions to be executed in the user context
session required pam_selinux.so open env_params
session required pam_namespace.so
session optional pam_keyinit.so force revoke
session optional pam_motd.so
session include password-auth
session required libcsmpam.so
session include postlogin
There should be two entries observed in the csm_compute.log when the csmapi trace level has been configured to trace
and the user logs in to the compute node:
[root@c650f99p36 ~]# /opt/ibm/csm/sbin/csm_ctrl_cmd --log.csmapi trace
Started client
### RESPONSE FROM THE LOCAL DAEMON ###
Error response could not be determined#######################################
[root@c650f99p36 ~]# tail -f /var/log/ibm/csm/csm_compute.log
[COMPUTE]2020-08-13 17:25:46.078915 csmnet::info | ReliableMsg: Connection verified: 1.8.1; host: tmp/csmi0.1409308.64930; seq#: 0
[COMPUTE]2020-08-13 17:25:46.079049 csmapi::info | CSM_CMD_cgroup_login[1286720182]; Client Sent; PID: 1409308; UID:0; GID:0
[COMPUTE]2020-08-13 17:25:46.079866 csmapi::info | [1286720182]; csm_cgroup_login start
[COMPUTE]2020-08-13 17:25:46.079995 csmapi::trace | Enter NetworkMessageState State ID: 0
[COMPUTE]2020-08-13 17:25:46.080022 csmapi::trace | Enter CGLoginInitState
[COMPUTE]2020-08-13 17:25:46.080136 csmapi::trace | Exit CGLoginInitState
[COMPUTE]2020-08-13 17:25:46.080163 csmapi::debug | TIMING: 1286720182,59,0,1597353946079847150,1597353946080162238,315088
[COMPUTE]2020-08-13 17:25:46.080184 csmapi::info | [1286720182]; csm_cgroup_login end
[COMPUTE]2020-08-13 17:25:46.080303 csmapi::info | CSM_CMD_cgroup_login[1286720182]; Client Recv; PID: 1409308; UID:0; GID:0
[COMPUTE]2020-08-13 17:25:46.080358 csmnet::info | Received disconnect msg from client tmp/csmi0.1409308.64930
[COMPUTE]2020-08-13 17:25:46.080413 csmnet::info | Deleting known client address: tmp/csmi0.1409308.64930
[COMPUTE]2020-08-13 17:25:46.155462 csmnet::info | ReliableMsg: Connection verified: 1.8.1; host: tmp/csmi0.1409308.64932; seq#: 0
[COMPUTE]2020-08-13 17:25:46.155519 csmapi::info | CSM_CMD_cgroup_login[615916784]; Client Sent; PID: 1409308; UID:0; GID:0
[COMPUTE]2020-08-13 17:25:46.155617 csmapi::info | [615916784]; csm_cgroup_login start
[COMPUTE]2020-08-13 17:25:46.155668 csmapi::trace | Enter NetworkMessageState State ID: 0
[COMPUTE]2020-08-13 17:25:46.155703 csmapi::trace | Enter CGLoginInitState
[COMPUTE]2020-08-13 17:25:46.155760 csmapi::trace | CGroup::CGroup Enter;
[COMPUTE]2020-08-13 17:25:46.155802 csmapi::trace | CGroup::CGroup Exit;
[COMPUTE]2020-08-13 17:25:46.155831 csmapi::trace | CGroup::MigratePid Enter; pid: 1409308
[COMPUTE]2020-08-13 17:25:46.158169 csmapi::trace | CGroup::MigratePid Exit;
[COMPUTE]2020-08-13 17:25:46.158203 csmapi::trace | Exit CGLoginInitState
[COMPUTE]2020-08-13 17:25:46.158231 csmapi::debug | TIMING: 615916784,59,0,1597353946155608176,1597353946158229732,2621556
[COMPUTE]2020-08-13 17:25:46.158254 csmapi::info | [615916784]; csm_cgroup_login end
[COMPUTE]2020-08-13 17:25:46.158858 csmapi::info | CSM_CMD_cgroup_login[615916784]; Client Recv; PID: 1409308; UID:0; GID:0
[COMPUTE]2020-08-13 17:25:46.158898 csmnet::info | Received disconnect msg from client tmp/csmi0.1409308.64932
[COMPUTE]2020-08-13 17:25:46.158950 csmnet::info | Deleting known client address: tmp/csmi0.1409308.64932
This block:
[COMPUTE]2020-08-13 17:25:46.079866 csmapi::info | [1286720182]; csm_cgroup_login start
[COMPUTE]2020-08-13 17:25:46.079995 csmapi::trace | Enter NetworkMessageState State ID: 0
[COMPUTE]2020-08-13 17:25:46.080022 csmapi::trace | Enter CGLoginInitState
[COMPUTE]2020-08-13 17:25:46.080136 csmapi::trace | Exit CGLoginInitState
[COMPUTE]2020-08-13 17:25:46.080163 csmapi::debug | TIMING: 1286720182,59,0,1597353946079847150,1597353946080162238,315088
[COMPUTE]2020-08-13 17:25:46.080184 csmapi::info | [1286720182]; csm_cgroup_login end
is invoked by the line:
account required libcsmpam.so
and does not trigger MigratePid to run.
This block:
[COMPUTE]2020-08-13 17:25:46.155617 csmapi::info | [615916784]; csm_cgroup_login start
[COMPUTE]2020-08-13 17:25:46.155668 csmapi::trace | Enter NetworkMessageState State ID: 0
[COMPUTE]2020-08-13 17:25:46.155703 csmapi::trace | Enter CGLoginInitState
[COMPUTE]2020-08-13 17:25:46.155760 csmapi::trace | CGroup::CGroup Enter;
[COMPUTE]2020-08-13 17:25:46.155802 csmapi::trace | CGroup::CGroup Exit;
[COMPUTE]2020-08-13 17:25:46.155831 csmapi::trace | CGroup::MigratePid Enter; pid: 1409308
[COMPUTE]2020-08-13 17:25:46.158169 csmapi::trace | CGroup::MigratePid Exit;
[COMPUTE]2020-08-13 17:25:46.158203 csmapi::trace | Exit CGLoginInitState
[COMPUTE]2020-08-13 17:25:46.158231 csmapi::debug | TIMING: 615916784,59,0,1597353946155608176,1597353946158229732,2621556
[COMPUTE]2020-08-13 17:25:46.158254 csmapi::info | [615916784]; csm_cgroup_login end
is invoked by the line:
session required libcsmpam.so
and triggers MigratePid to run.
The MigratePid function places the user ssh session into the correct allocation cgroup when everything is functioning correctly.
I spent some more time debugging this to try to answer the question as to why the ordering of these two lines is impacting the libcsmpam.so
cgroup assignment:
# cat /etc/pam.d/sshd | tail -n 2
session include postlogin
session required libcsmpam.so
I noticed that this line was added to /etc/pam.d/postlogin
between RHEL 7 and RHEL 8:
# grep umask /etc/pam.d/postlogin
session optional pam_umask.so silent
and I also found this issue that seemed like it could be related:
https://bugzilla.redhat.com/show_bug.cgi?id=1777718
but commenting that line out or changing the UMASK in /etc/login.defs
did not change the behavior.
# grep UMASK /etc/login.defs
UMASK 077
I then did some additional experiments with commenting out lines in /etc/pam.d/postlogin
, and I was able to get placed into the correct cgroup with this configuration combination:
# cat /etc/pam.d/sshd | tail -n 2
session include postlogin
session required libcsmpam.so
with this:
# cat /etc/pam.d/postlogin | grep session
session optional pam_umask.so silent
session [success=1 default=ignore] pam_succeed_if.so service !~ gdm* service !~ su* quiet
# session [default=1] pam_lastlog.so nowtmp showfailed
session optional pam_lastlog.so silent noupdate showfailed
or this:
# cat /etc/pam.d/postlogin | grep session
session optional pam_umask.so silent
session [success=1 default=ignore] pam_succeed_if.so service !~ gdm* service !~ su* quiet
session [default=1] pam_lastlog.so nowtmp showfailed
# session optional pam_lastlog.so silent noupdate showfailed
but I get put into the wrong cgroup if both of those lines are included:
# cat /etc/pam.d/postlogin | grep session
session optional pam_umask.so silent
session [success=1 default=ignore] pam_succeed_if.so service !~ gdm* service !~ su* quiet
session [default=1] pam_lastlog.so nowtmp showfailed
session optional pam_lastlog.so silent noupdate showfailed
So there is still some mystery related to pam_lastlog.so
being called by postlogin
, but I haven't been able to determine exactly what the root cause is yet.
This was fixed via PR #955.
With CSM version 1.8.1 installed on RHEL 8.1, after requesting an allocation with isolated cores, ssh sessions to the compute node fail to place the user's tasks into the allocation cgroup.
Example: