As you know in the context of a HPC POC in AWS for a french company, UCit (mainly myself) have sligtly modified and used 1Click-HPC to run the POC's HPC environment.
I have faced a strange authentication issue while trying to start a DCV session on a g4dn instance with Centos 7.9.2009 + DCV 2022.0 r12760 + EnginFrame (EF) 2021.0-r1592 + Slurm 21.08.8-2 on AWS. The issue is specific to the users stored in the AD attached to the cluster.
I have finally found a fix for the issue but I think it's important to discuss it with you to understand what's the underlying behaviour here...
The symptom is the following;
launching a DCV Session as system user "centos" using a standard Linux Desktop Service in EF works fine
launching a DCV session as a user created in the AD using the exact same standard Linux Desktop Service in EF fails because of an autentication issue.
The error message got in slurm-$JobID.out is the following:
[2022/06/09 14:40:15] INFO Starting DCV session...
[2022/06/09 14:40:15] INFO DCV version supports --gl-display parameter
[2022/06/09 14:40:15] INFO Creating DCV session "dcv create-session --type=virtual tmp7339021573904918669 "
Could not create session. Could not get the system bus: Exhausted all available authentication mechanisms (tried: EXTERNAL, DBUS_COOKIE_SHA1, ANONYMOUS) (available: EXTERNAL, DBUS_COOKIE_SHA1, ANONYMOUS)
[2022/06/09 14:40:15] ERROR Failed to launch DCV session (exit code: 1)
[2022/06/09 14:40:15] FATAL Error: DCV failed to create session
[2022/06/09 14:40:15] FATAL Exiting with code 1
After a lot of tests, described below, I have found a solution which consists in adding at the very beginning of the file $EF_ROOT/plugins/interactive/lib/remote/linux.jobscript.functions the following line:
id "${USER}"
Indeed, this initializes kind of a "first connection" of the user trying to start a session on the targeted system, so that the User is known at system level...
With the help of Benjamin Depardon, I have tested the issue by issueing on the Head Node of the cluster in the command line the following command:
And then, we have tried all the following options:
restarting gdm only after dcvserver was restarted at the end of the installation process => NOT working
restarting dbus + dbus-org.+ gdm after dcvserver was restarted at the end of the installation process => NOT working
changing /etc/pam.d/dcv with the following contents
#%PAM-1.0
# Default NICE DCV PAM configuration.
# This file is auto-generated, user changes will be destroyed at
# installation/update time.
# To make changes, create a file named dcv.custom in this
# directory and set the 'pam-service-name' parameter in the
# [security] section of dcv.conf to 'dcv.custom'
#auth include password-auth
#account include password-auth
auth include password-auth
account required pam_access.so
account required pam_unix.so
account sufficient pam_localuser.so
account sufficient pam_usertype.so issystem
account [default=bad success=ok user_unknown=ignore] pam_sss.so
account required pam_permit.so
=> NOT working
running on the remote system the commands:
$> getent passwd | grep username
or
$> getent passwd -s sss | grep username
or
$> sssctl cache-upgrade
=> NOT working
adding the following command at the very beginning of Slurm's prolog.sh script:
$> id "${SLURM_JOB_USER}"
-> NOT working
running the following command on the DCV node before the session was created:
$> id username
or
$> sssctl user-checks username
=> SUCCESSFUL
connecting on the DCV node with SSH as the user username (or as any other user and the switching with the command: $> su - username) before the session was created
=> SUCCESSFUL
Our conclusion is that the user must be known by the system (and stored in any kind of cache) for the authentication process to allow the execution of the tasks required by the Slurm job.
Our questions are:
is it a know issue?
can you explain further how the internal authentication methods of DCV work and why in our case DCV has denied the authorization for the user in AD to crete a session?
is there a "better" way to solve it than to hack EF code the way we did to allow any user in AD to launch a DCV session?
Please don't hesitate to ask for any complementary information and to let us know what you think.
Ciao Nicola!
As you know in the context of a HPC POC in AWS for a french company, UCit (mainly myself) have sligtly modified and used 1Click-HPC to run the POC's HPC environment. I have faced a strange authentication issue while trying to start a DCV session on a g4dn instance with Centos 7.9.2009 + DCV 2022.0 r12760 + EnginFrame (EF) 2021.0-r1592 + Slurm 21.08.8-2 on AWS. The issue is specific to the users stored in the AD attached to the cluster.
I have finally found a fix for the issue but I think it's important to discuss it with you to understand what's the underlying behaviour here...
The symptom is the following;
The error message got in slurm-$JobID.out is the following:
After a lot of tests, described below, I have found a solution which consists in adding at the very beginning of the file $EF_ROOT/plugins/interactive/lib/remote/linux.jobscript.functions the following line:
id "${USER}"
Indeed, this initializes kind of a "first connection" of the user trying to start a session on the targeted system, so that the User is known at system level...
With the help of Benjamin Depardon, I have tested the issue by issueing on the Head Node of the cluster in the command line the following command:
srun -N 1 -p dcv-gpu --exclusive -C "[g4dn.xlarge*1]" dcv create-session my_session
And then, we have tried all the following options:
restarting gdm only after dcvserver was restarted at the end of the installation process => NOT working
restarting dbus + dbus-org.+ gdm after dcvserver was restarted at the end of the installation process => NOT working
changing /etc/pam.d/dcv with the following contents
=> NOT working
running on the remote system the commands:
$> getent passwd | grep username
or$> getent passwd -s sss | grep username
or$> sssctl cache-upgrade
=> NOT workingadding the following command at the very beginning of Slurm's prolog.sh script:
$> id "${SLURM_JOB_USER}"
-> NOT workingrunning the following command on the DCV node before the session was created:
$> id username
or$> sssctl user-checks username
=> SUCCESSFULconnecting on the DCV node with SSH as the user username (or as any other user and the switching with the command:
$> su - username
) before the session was created=> SUCCESSFUL
Our conclusion is that the user must be known by the system (and stored in any kind of cache) for the authentication process to allow the execution of the tasks required by the Slurm job.
Our questions are:
Please don't hesitate to ask for any complementary information and to let us know what you think.
Best regards, Vincent.