aws-samples / 1click-hpc

Deploy your HPC Cluster on AWS in 20min. with just 1-Click.
MIT No Attribution
62 stars 44 forks source link

DCV authentication issue with AD users at the creation of Linux DCV sessions in a 1Click-HPC cluster #17

Open vbosquier opened 2 years ago

vbosquier commented 2 years ago

Ciao Nicola!

As you know in the context of a HPC POC in AWS for a french company, UCit (mainly myself) have sligtly modified and used 1Click-HPC to run the POC's HPC environment. I have faced a strange authentication issue while trying to start a DCV session on a g4dn instance with Centos 7.9.2009 + DCV 2022.0 r12760 + EnginFrame (EF) 2021.0-r1592 + Slurm 21.08.8-2 on AWS. The issue is specific to the users stored in the AD attached to the cluster.

I have finally found a fix for the issue but I think it's important to discuss it with you to understand what's the underlying behaviour here...

The symptom is the following;

The error message got in slurm-$JobID.out is the following:

[2022/06/09 14:40:15]  INFO  Starting DCV session...
[2022/06/09 14:40:15]  INFO  DCV version supports --gl-display parameter
[2022/06/09 14:40:15]  INFO  Creating DCV session "dcv create-session --type=virtual tmp7339021573904918669 "
Could not create session. Could not get the system bus: Exhausted all available authentication mechanisms (tried: EXTERNAL, DBUS_COOKIE_SHA1, ANONYMOUS) (available: EXTERNAL, DBUS_COOKIE_SHA1, ANONYMOUS)
[2022/06/09 14:40:15] ERROR  Failed to launch DCV session (exit code: 1)
[2022/06/09 14:40:15] FATAL  Error: DCV failed to create session
[2022/06/09 14:40:15] FATAL  Exiting with code 1

After a lot of tests, described below, I have found a solution which consists in adding at the very beginning of the file $EF_ROOT/plugins/interactive/lib/remote/linux.jobscript.functions the following line:

id "${USER}"

Indeed, this initializes kind of a "first connection" of the user trying to start a session on the targeted system, so that the User is known at system level...

With the help of Benjamin Depardon, I have tested the issue by issueing on the Head Node of the cluster in the command line the following command:

srun -N 1 -p dcv-gpu --exclusive -C "[g4dn.xlarge*1]" dcv create-session my_session

And then, we have tried all the following options:

=> SUCCESSFUL

Our conclusion is that the user must be known by the system (and stored in any kind of cache) for the authentication process to allow the execution of the tasks required by the Slurm job.

Our questions are:

Please don't hesitate to ask for any complementary information and to let us know what you think.

Best regards, Vincent.