NERSC / shifter

Shifter - Linux Containers for HPC
Other
348 stars 65 forks source link

use of populateEtcDynamically=1 generates failures to start the sshd #272

Open dmjacobsen opened 4 years ago

dmjacobsen commented 4 years ago

I am evaluating using:

allowLibcPwdCalls=1
populateEtcDynamically=1

in udiRoot.conf in order to get away from the centralized password file which is now starting to cause problems of another sort.

When populateEtcDynamically is enabled a passwd and group file are generated, and they only include the current user and the primary group for that user. This lack of auxiliary is problematic and should also be fixed. The lack of an "sshd" in /etc/passwd causes the integrated sshd to fail with:

[2019-12-28T01:43:59.257] error: setupRoot stdout: Generating public/private dsa key pair.

[2019-12-28T01:43:59.485] error: setupRoot stderr: Privilege separation user sshd does not exist^M

[2019-12-28T01:43:59.485] error: setupRoot stdout: Your identification has been saved in /var/udiMount/opt/udiImage/etc/ssh_host_dsa_key.

[2019-12-28T01:43:59.485] error: setupRoot stderr: FAILED to start sshd

[2019-12-28T01:43:59.485] error: setupRoot stdout: Your public key has been saved in /var/udiMount/opt/udiImage/etc/ssh_host_dsa_key.pub.

[2019-12-28T01:43:59.485] error: waiting on setupRoot

[2019-12-28T01:43:59.485] error: FAILED to run setupRoot
[2019-12-28T01:43:59.485] error: after setupRoot, exit code: 1

(from a slurmd log)

It might be good if populateEtcDynamically could augment an existing skeleton passwd/group file with the current user and all (up to maxGroupCount) groups for that user.

scanon commented 4 years ago

Can you elaborate on what you mean by "which is now starting to cause problems of another sort."?

dmjacobsen commented 4 years ago

sure, there have been two issues with using a passwd and group file in etcFiles

1) in the NERSC deployment we've had a cron job generate the files, which meant new users might not appear until the cron job reran. Also, from time to time that cron job has broken for one reason or another and has been a source of additional maintenance.

2) in order to easily generate the passwd/group files via the cron job, the NERSC deployment has had to enable sssd enumeration. with a large quantity of users we have found that enumeration has generated major performance issues with sssd with some lookups. this has impacted slurmctld performance rather badly. thus, it is preferable to disable sssd enumeration, in order to disable this we also have to either move generation of the passwd/group files for shifter to another node in the system, or move to this configuration.

in light of slurm's recent nss_slurm, it is now possible to scalably lookup users with allowLibcPwdCalls, which was not the case during the original development of shifter. ironically, it is also use of nss_slurm which is potentially driving some of the sssd enumeration performance impacts for the slurm controller.