IBM / CAST

CAST can enhance the system management of cluster-wide resources. It consists of the open source tools: cluster system management (CSM) and burst buffer.
Eclipse Public License 1.0
27 stars 34 forks source link

CSM RHEL 8: user tasks are not assigned to the correct set of CPUs when requesting an allocation with core isolation #946

Closed besawn closed 3 years ago

besawn commented 4 years ago

With CSM version 1.8.1 installed on RHEL 8.1, after requesting an allocation with isolated cores, ssh sessions to the compute node fail to place the user's tasks into the allocation cgroup.

Example:

# Create an allocation with isolated cores
[root@c650f99p06 c650mnp05]# /opt/ibm/csm/bin/csm_allocation_create -n c650f99p36 -j 111 -u besawn --isolated_cores 2
---
allocation_id: 3
num_nodes: 1
- compute_nodes: c650f99p36
user_name: besawn
user_id: 8260035
state: running
type: user-managed
job_submit_time: 2020-08-13 16:52:11
...

# Confirm that there are is a csm_system and allocation cgroup with different sets of CPUs configured on the compute node, as expected
[root@c650f99p06 c650mnp05]# xdsh c650f99p36 'cat /sys/fs/cgroup/cpuset/csm_system/cpuset.cpus'
c650f99p36: 0-7,88-95

[root@c650f99p06 c650mnp05]# xdsh c650f99p36 'cat /sys/fs/cgroup/cpuset/allocation*/cpuset.cpus'
c650f99p36: 8-87,96-175

# Run the yes command for 5 seconds on 8 CPUs
[besawn@c650f99p36 ~]$ /usr/bin/seq 8 | /usr/bin/xargs -P0 -n1 timeout 5 yes > /dev/null &
[1] 1408045

# Check to see the CPUs (PSR) and CGROUP assignments for the yes tasks
[besawn@c650f99p36 ~]$ ps awxf -eo pid,user,psr,cgroup,cmd | egrep "USER|yes" | grep -v "timeout"
    PID USER     PSR CGROUP                      CMD
1408050 besawn    88 12:cpuset:/csm_system,9:dev              |   |   \_ yes 1
1408051 besawn     4 12:cpuset:/csm_system,9:dev              |   |   \_ yes 2
1408053 besawn    94 12:cpuset:/csm_system,9:dev              |   |   \_ yes 3
1408054 besawn    90 12:cpuset:/csm_system,9:dev              |   |   \_ yes 4
1408056 besawn     1 12:cpuset:/csm_system,9:dev              |   |   \_ yes 5
1408059 besawn    92 12:cpuset:/csm_system,9:dev              |   |   \_ yes 6
1408060 besawn    95 12:cpuset:/csm_system,9:dev              |   |   \_ yes 7
1408061 besawn    91 12:cpuset:/csm_system,9:dev              |       \_ yes 8

-> Processes are running on the wrong CPUs and in the wrong cgroup.

# Check to see if we are correctly placed into the allocation cgroup:
[besawn@c650f99p36 ~]$ cat /proc/self/cgroup | egrep "allocation|csm"
12:cpuset:/csm_system
3:cpu,cpuacct:/csm_system

-> We are incorrectly running out of the csm_system cgroup when we should be in the allocation cgroup.
besawn commented 4 years ago

Root cause: The libcsmpam.so module is not being called correctly for the session entry in the default /etc/pam.d/sshd configuration file order. The problem can be resolved by reversing the order of these lines in the file:

[root@c650f99p36 dcgm_2081_regression_tests]# cat /etc/pam.d/sshd | tail -n 2
session    include      postlogin
session    required     libcsmpam.so

[root@c650f99p36 dcgm_2081_regression_tests]# vi /etc/pam.d/sshd

[root@c650f99p36 dcgm_2081_regression_tests]# cat /etc/pam.d/sshd | tail -n 2
session    required     libcsmpam.so
session    include      postlogin

After the change above:

[root@c650f99p06 c650mnp05]# su - besawn -c "ssh c650f99p36"
Last login: Thu Aug 13 17:04:34 2020 from 10.7.99.6

# Verify that we are now correctly placed in the allocation cgroup
[besawn@c650f99p36 ~]$ cat /proc/self/cgroup | egrep "allocation|csm"
12:cpuset:/allocation_3
9:devices:/allocation_3
3:cpu,cpuacct:/allocation_3
2:memory:/allocation_3

# Run the yes command for 5 seconds on 8 CPUs
[besawn@c650f99p36 ~]$ /usr/bin/seq 8 | /usr/bin/xargs -P0 -n1 timeout 5 yes > /dev/null &
[1] 1409543

# Confirm that the tasks are now on the expected CPUs and in the correct allocation cgroup
[besawn@c650f99p36 ~]$ ps awxf -eo pid,user,psr,cgroup,cmd | egrep "USER|yes" | grep -v "timeout"
    PID USER     PSR CGROUP                      CMD
1409545 besawn    70 12:cpuset:/allocation_3,9:d              |   |   \_ yes 1
1409547 besawn    60 12:cpuset:/allocation_3,9:d              |   |   \_ yes 2
1409549 besawn    76 12:cpuset:/allocation_3,9:d              |   |   \_ yes 3
1409551 besawn    36 12:cpuset:/allocation_3,9:d              |   |   \_ yes 4
1409554 besawn    53 12:cpuset:/allocation_3,9:d              |   |   \_ yes 5
1409557 besawn    24 12:cpuset:/allocation_3,9:d              |   |   \_ yes 6
1409558 besawn    22 12:cpuset:/allocation_3,9:d              |   |   \_ yes 7
1409559 besawn    46 12:cpuset:/allocation_3,9:d              |       \_ yes 8
1409561 besawn    85 12:cpuset:/allocation_3,9:d              \_ grep -E --color=auto USER|yes
besawn commented 4 years ago

While debugging, I was able to observe whether libcsmpam.so was being called by the sshd pam process using the following steps. In the typical recommended configuration where there are two entries for libcsmpam.so in /etc/pam.d/sshd, one for account and one for session:

[root@c650f99p36 ~]# cat /etc/pam.d/sshd
#%PAM-1.0
auth       substack     password-auth
auth       include      postlogin
account    required     pam_sepermit.so
account    required     pam_nologin.so
account    required     libcsmpam.so
account    include      password-auth
password   include      password-auth
# pam_selinux.so close should be the first session rule
session    required     pam_selinux.so close
session    required     pam_loginuid.so
# pam_selinux.so open should only be followed by sessions to be executed in the user context
session    required     pam_selinux.so open env_params
session    required     pam_namespace.so
session    optional     pam_keyinit.so force revoke
session    optional     pam_motd.so
session    include      password-auth
session    required     libcsmpam.so
session    include      postlogin

There should be two entries observed in the csm_compute.log when the csmapi trace level has been configured to trace and the user logs in to the compute node:

[root@c650f99p36 ~]# /opt/ibm/csm/sbin/csm_ctrl_cmd --log.csmapi trace
Started client 

### RESPONSE FROM THE LOCAL DAEMON ###
Error response could not be determined#######################################

[root@c650f99p36 ~]# tail -f /var/log/ibm/csm/csm_compute.log
[COMPUTE]2020-08-13 17:25:46.078915     csmnet::info     | ReliableMsg: Connection verified: 1.8.1; host: tmp/csmi0.1409308.64930; seq#: 0
[COMPUTE]2020-08-13 17:25:46.079049     csmapi::info     | CSM_CMD_cgroup_login[1286720182]; Client Sent; PID: 1409308; UID:0; GID:0
[COMPUTE]2020-08-13 17:25:46.079866     csmapi::info     | [1286720182]; csm_cgroup_login start
[COMPUTE]2020-08-13 17:25:46.079995     csmapi::trace    | Enter NetworkMessageState State ID: 0
[COMPUTE]2020-08-13 17:25:46.080022     csmapi::trace    | Enter CGLoginInitState
[COMPUTE]2020-08-13 17:25:46.080136     csmapi::trace    | Exit CGLoginInitState
[COMPUTE]2020-08-13 17:25:46.080163     csmapi::debug    | TIMING: 1286720182,59,0,1597353946079847150,1597353946080162238,315088
[COMPUTE]2020-08-13 17:25:46.080184     csmapi::info     | [1286720182]; csm_cgroup_login end
[COMPUTE]2020-08-13 17:25:46.080303     csmapi::info     | CSM_CMD_cgroup_login[1286720182]; Client Recv; PID: 1409308; UID:0; GID:0
[COMPUTE]2020-08-13 17:25:46.080358     csmnet::info     | Received disconnect msg from client tmp/csmi0.1409308.64930
[COMPUTE]2020-08-13 17:25:46.080413     csmnet::info     | Deleting known client address: tmp/csmi0.1409308.64930
[COMPUTE]2020-08-13 17:25:46.155462     csmnet::info     | ReliableMsg: Connection verified: 1.8.1; host: tmp/csmi0.1409308.64932; seq#: 0
[COMPUTE]2020-08-13 17:25:46.155519     csmapi::info     | CSM_CMD_cgroup_login[615916784]; Client Sent; PID: 1409308; UID:0; GID:0
[COMPUTE]2020-08-13 17:25:46.155617     csmapi::info     | [615916784]; csm_cgroup_login start
[COMPUTE]2020-08-13 17:25:46.155668     csmapi::trace    | Enter NetworkMessageState State ID: 0
[COMPUTE]2020-08-13 17:25:46.155703     csmapi::trace    | Enter CGLoginInitState
[COMPUTE]2020-08-13 17:25:46.155760     csmapi::trace    | CGroup::CGroup Enter;
[COMPUTE]2020-08-13 17:25:46.155802     csmapi::trace    | CGroup::CGroup Exit;
[COMPUTE]2020-08-13 17:25:46.155831     csmapi::trace    | CGroup::MigratePid Enter; pid: 1409308
[COMPUTE]2020-08-13 17:25:46.158169     csmapi::trace    | CGroup::MigratePid Exit;
[COMPUTE]2020-08-13 17:25:46.158203     csmapi::trace    | Exit CGLoginInitState
[COMPUTE]2020-08-13 17:25:46.158231     csmapi::debug    | TIMING: 615916784,59,0,1597353946155608176,1597353946158229732,2621556
[COMPUTE]2020-08-13 17:25:46.158254     csmapi::info     | [615916784]; csm_cgroup_login end
[COMPUTE]2020-08-13 17:25:46.158858     csmapi::info     | CSM_CMD_cgroup_login[615916784]; Client Recv; PID: 1409308; UID:0; GID:0
[COMPUTE]2020-08-13 17:25:46.158898     csmnet::info     | Received disconnect msg from client tmp/csmi0.1409308.64932
[COMPUTE]2020-08-13 17:25:46.158950     csmnet::info     | Deleting known client address: tmp/csmi0.1409308.64932

This block:

[COMPUTE]2020-08-13 17:25:46.079866     csmapi::info     | [1286720182]; csm_cgroup_login start
[COMPUTE]2020-08-13 17:25:46.079995     csmapi::trace    | Enter NetworkMessageState State ID: 0
[COMPUTE]2020-08-13 17:25:46.080022     csmapi::trace    | Enter CGLoginInitState
[COMPUTE]2020-08-13 17:25:46.080136     csmapi::trace    | Exit CGLoginInitState
[COMPUTE]2020-08-13 17:25:46.080163     csmapi::debug    | TIMING: 1286720182,59,0,1597353946079847150,1597353946080162238,315088
[COMPUTE]2020-08-13 17:25:46.080184     csmapi::info     | [1286720182]; csm_cgroup_login end

is invoked by the line: account required libcsmpam.so and does not trigger MigratePid to run.

This block:

[COMPUTE]2020-08-13 17:25:46.155617     csmapi::info     | [615916784]; csm_cgroup_login start
[COMPUTE]2020-08-13 17:25:46.155668     csmapi::trace    | Enter NetworkMessageState State ID: 0
[COMPUTE]2020-08-13 17:25:46.155703     csmapi::trace    | Enter CGLoginInitState
[COMPUTE]2020-08-13 17:25:46.155760     csmapi::trace    | CGroup::CGroup Enter;
[COMPUTE]2020-08-13 17:25:46.155802     csmapi::trace    | CGroup::CGroup Exit;
[COMPUTE]2020-08-13 17:25:46.155831     csmapi::trace    | CGroup::MigratePid Enter; pid: 1409308
[COMPUTE]2020-08-13 17:25:46.158169     csmapi::trace    | CGroup::MigratePid Exit;
[COMPUTE]2020-08-13 17:25:46.158203     csmapi::trace    | Exit CGLoginInitState
[COMPUTE]2020-08-13 17:25:46.158231     csmapi::debug    | TIMING: 615916784,59,0,1597353946155608176,1597353946158229732,2621556
[COMPUTE]2020-08-13 17:25:46.158254     csmapi::info     | [615916784]; csm_cgroup_login end

is invoked by the line: session required libcsmpam.so and triggers MigratePid to run.

The MigratePid function places the user ssh session into the correct allocation cgroup when everything is functioning correctly.

besawn commented 4 years ago

I spent some more time debugging this to try to answer the question as to why the ordering of these two lines is impacting the libcsmpam.so cgroup assignment:

# cat /etc/pam.d/sshd | tail -n 2
session    include      postlogin
session    required     libcsmpam.so

I noticed that this line was added to /etc/pam.d/postlogin between RHEL 7 and RHEL 8:

# grep umask /etc/pam.d/postlogin
session optional                   pam_umask.so silent

and I also found this issue that seemed like it could be related: https://bugzilla.redhat.com/show_bug.cgi?id=1777718 but commenting that line out or changing the UMASK in /etc/login.defs did not change the behavior.

# grep UMASK  /etc/login.defs 
UMASK           077

I then did some additional experiments with commenting out lines in /etc/pam.d/postlogin, and I was able to get placed into the correct cgroup with this configuration combination:

# cat /etc/pam.d/sshd | tail -n 2
session    include      postlogin
session    required     libcsmpam.so

with this:

# cat /etc/pam.d/postlogin | grep session
session optional                   pam_umask.so silent
session [success=1 default=ignore] pam_succeed_if.so service !~ gdm* service !~ su* quiet
# session [default=1]                pam_lastlog.so nowtmp showfailed
session optional                   pam_lastlog.so silent noupdate showfailed

or this:

# cat /etc/pam.d/postlogin | grep session
session optional                   pam_umask.so silent
session [success=1 default=ignore] pam_succeed_if.so service !~ gdm* service !~ su* quiet
session [default=1]                pam_lastlog.so nowtmp showfailed
# session optional                   pam_lastlog.so silent noupdate showfailed

but I get put into the wrong cgroup if both of those lines are included:

# cat /etc/pam.d/postlogin | grep session
session optional                   pam_umask.so silent
session [success=1 default=ignore] pam_succeed_if.so service !~ gdm* service !~ su* quiet
session [default=1]                pam_lastlog.so nowtmp showfailed
session optional                   pam_lastlog.so silent noupdate showfailed

So there is still some mystery related to pam_lastlog.so being called by postlogin, but I haven't been able to determine exactly what the root cause is yet.

besawn commented 3 years ago

This was fixed via PR #955.