aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
828 stars 312 forks source link

How I setup parallel cluster with Azure Active Directory Domain Services (AD DS) #3581

Closed gwolski closed 2 years ago

gwolski commented 2 years ago

I've been trying to integrate parallel cluster v2.11.3 (and before) with Azure Active Directory Domain Services to which I'm connected by VPN. I've gleaned bits and pieces from ideas in the internet and this thread area. I am no Azure AD DS expert…

I thought I'd document what I've learned for feedback and to have an entry for others to find and add to.... Thank you for tips from @enrico-usai and @demartinofra that I've been able to piece together.

My naming convention and AD setup works well for all my EC2 interactive machines -- I've had no problems.

With parallel cluster I'm having problems as the hostnames are too long. Microsoft makes it very clear that hostnames should be less than 15 characters and the NETBIOS names have to match the hostnames, yet NETBIOS names are limited to 15 characters and hostnames are limited to 63 characters - I can't say I have grok'ed this contradiction. But here is where they are clear that Linux VM names are limited to 15 chars:

https://docs.microsoft.com/en-us/azure/active-directory-domain-services/join-centos-linux-vm

As an aside, I have hostnames that are long than 15 characters on some of my machines, but there is no conflict with other hostnames as the first 15 characters make it unique and since I don't use NETBIOS features, I'm probably ok. I only use AD DS for DNS and Identity of my users.

I have two implementations that both seem to work with parallel cluster that I am documenting here. Comments/feedback welcome.

My Pcluster Setup

I run multiple clusters and I make sure the queues are prefixed with a unique identifier. I will be using my 'dev' cluster as an example here.

Here is a sample from my config file:

[queue devc2m4-spot]
compute_type = spot
compute_resource_settings = compute_c2m4-spot
disable_hyperthreading = false
placement_group = DYNAMIC

[compute_resource compute_c2m4-spot]
instance_type = c5.large
min_count = 0
initial_count = 0
max_count = 5

Here is an example hostname that then gets created by pcluster:

devc2m4-spot-dy-c5large-1

Unfortunately, the difference in name between this machine and another occurs in the 26th character, the instance count, i.e. '1'. Oddly enough, I was able to register about 3 hosts (1,2, & 8) with this name format before things started failing for additional AD registrations with all sorts of weird errors.

Here are two options that seem to work. I'll be first to admit I'm new to managing with AD and I'm learning about this by reading along with trial and error.

Option 1

Here I have set:

disable_cluster_dns = false

to have pcluster create a DNS server.

I'm going to create a unique name and register it with AD so my authentication will work, but Route53 will still do the DNS work…

I use a pre-install script and I create a 15 character or less AD hostname I will pass to realm join by parsing the actual hostname. The above hostname becomes an "AD hostname" of:

ADCOMPUTERNAME=dc2m4s-dy-1

I then register this hostname with AD with realm join and modify my /etc/sssd/sssd.conf file by adding an entry with this name so authentication works. I do not rely on AD for DNS, I still leave it to parallelcluster to create the Route53 cluster.pcluster domain. So while the machine is domain joined to my AD domain, any DNS for the longer pcluster generated name fails there, and a lookup happens in the .pcluster world and DNS still works.

I also add an entry with this short AD name with FQDN in /etc/hosts, not 100% sure if this is needed:

  sed -i -e "s/$LOCAL_IP/$LOCAL_IP $SHORTNAME.${ad_domain} /" /etc/hosts

($LOCAL_IP is the IP address that is already in the /etc/hosts file, I just add this new short name right after the IP address).

Here is how I join the AD domain:

realm -v join --membership-software=adcli --computer-name=$ADCOMPUTERNAME ${ad_domain} -U ${ad_username}

This is the entry I add to /etc/sssd/sssd.conf:

  sed -i -e "$ a ad_hostname = $ADCOMPUTERNAME.${ad_domain}" /etc/sssd/sssd.conf

${ad_domain} and ${ad_username} are replaced with my domain and the AD user that has authority to register, respectively.

The benefit of this option, is that I can then still just grab the hostname from the output of squeue and just ssh to it if I need to…. Essentially AD is only used for IdP services as the DNS is really handled by Route53.

Option 2

This idea is based on this comment: https://github.com/aws/aws-parallelcluster/issues/2577#issuecomment-810401832

Here I've implemented this all in a post-install script..

My parallel cluster config file contains:

disable_cluster_dns = true
extra_json = {"cluster": {"use_private_hostname": "true"}}

I just let the hostname be the name based on the ip-<privateip>. Now I have my unique, less than 15 character name and DNS and Identity Provider features work just fine. However if I need to get on the compute node, I do have to run another command to map from the output of squeue to the actual hostname. That's a bit tedious, but works:

$ scontrol show nodes $hostname | grep NodeHostName | cut -f3 -d'=' | cut -f1 -d' '

Next Steps, where I think I’m going.

I'm thinking of implementing option 1, but use the IP address of item 2 as the unique identifier for registering with AD. That gives me the guaranteed uniqueness -- I don't have to worry about parsing the name. So my AD will have computer name that matches the IP address, but DNS of the longer name is really happening with Route 53…. This allows sssd to still authenticate to AD, and Route 53 does the DNS for the cluster.

Note that Master node always has the ip-<privateip> name and that is registered in AD and works just fine, i.e. I get to my Master with DNS resolution coming from AD…

Feedback welcome.

enrico-usai commented 2 years ago

Hi @gwolski thanks for sharing this detailed approach.

Let me add that starting from ParallelCluster 3.1.1 clusters can be configured to use an AD domain managed via one of the AWS Directory Service options like Simple AD or AWS Managed Microsoft AD (MSAD). To quickly get started with AWS managed AD you can follow our tutorial.

Additional resources:

It seems to me that here are no open questions here so I'm going to put it in auto-resolve. Thanks again for sharing.

Enrico

github-actions[bot] commented 2 years ago

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.