1192

Bug description and how to reproduce: I am trying to setup a parallel cluster in a sub-account inside a AWS Landing Zone managed organization. Our LZ uses an AWS Managed AD. When an account is created via the vending machine, a default DHCP options set that points to the AD DNS is assigned to the VPC created in the account. If this DHCP options is not changed to a DHCP options set with domain-name = <region_name>.compute.internal; domain-name-servers = AmazonProvidedDNS,pcluster create fails at the sanity check of the auto-scaling group

2019-07-16 10:50:27,188 - DEBUG - pcluster.pcluster - pcluster CLI starting
2019-07-16 10:50:27,194 - DEBUG - pcluster.pcluster - Namespace(cluster_name='dumontj-test-cluster', cluster_template=None, command='create', config_file='ParallelClusterConfig', extra_parameters=None, func=<func\
tion create at 0x7f9734abd2f0>, norollback=False, nowait=False, region=None, tags=None, template_url=None)
2019-07-16 10:50:27,194 - INFO - pcluster.pcluster - Beginning cluster creation for cluster: dumontj-test-cluster
2019-07-16 10:50:27,194 - DEBUG - pcluster.pcluster - Building cluster config based on args Namespace(cluster_name='dumontj-test-cluster', cluster_template=None, command='create', config_file='ParallelClusterConf\
ig', extra_parameters=None, func=<function create at 0x7f9734abd2f0>, norollback=False, nowait=False, region=None, tags=None, template_url=None)
2019-07-16 10:50:28,970 - INFO - pcluster.pcluster - Creating stack named: parallelcluster-dumontj-test-cluster
2019-07-16 10:50:29,345 - DEBUG - pcluster.pcluster - StackId: arn:aws:cloudformation:ca-central-1:155721880543:stack/parallelcluster-dumontj-test-cluster/0dabde60-a7d9-11e9-a9d5-06fb6b41a85e
2019-07-16 11:28:57,919 - DEBUG - pcluster.pcluster - Status: parallelcluster-dumontj-test-cluster - ROLLBACK_IN_PROGRESS
2019-07-16 11:28:57,920 - CRITICAL - pcluster.pcluster -
Cluster creation failed.  Failed events:
2019-07-16 11:28:58,098 - INFO - pcluster.pcluster -   - AWS::AutoScaling::AutoScalingGroup ComputeFleet Received 0 SUCCESS signal(s) out of 1.  Unable to satisfy 100% MinSuccessfulInstancesPercent requirement
2019-07-16 11:28:58,099 - INFO - pcluster.pcluster -

However, if you log in to the master server and try to schedule jobs without changing the DHCP options set back to the one that points to the AD DNS, you get errors like:

[ec2-user@ip-10-0-12-47 ~]$ srun -N 2 -n2 hostname
srun: Required node not available (down, drained or reserved)
srun: job 2 queued and waiting for resources
srun: job 2 has been allocated resources
srun: error: fwd_tree_thread: can't find address for host ip-10-0-12-32, check slurm.conf
srun: error: fwd_tree_thread: can't find address for host ip-10-0-12-25, check slurm.conf
srun: error: Task launch for 2.0 failed on node ip-10-0-12-25: Can't find an address, check slurm.conf
srun: error: Task launch for 2.0 failed on node ip-10-0-12-32: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
 srun: error: Timed out waiting for job step to complete

I propose that a new configuration item be added to the [vpc] section of the configuration, something like dhcp_options_set_build that would take a DHCP options set ID. pcluster create could temporarily assign the DHCP options set to the VPC under consideration, and replace the original DHCP options set after.

This is not a super elegant solution, but it would work.

joeydumont commented 5 years ago

Or the script itself could create the DHCP options set, assign it to the VPC, do its thing, re-assign the original DHCP options set, then delete the temporary set.

demartinofra commented 5 years ago

One idea might be to use private IPs everywhere but we need to verify if the current schedulers configuration and all supported features are compatible with that.

joeydumont commented 5 years ago

Indeed. I'm currently trying to testing deployment of multiple cluster configurations to see what works and what doesn't. I'm a little confused, because my most recent deployments did not even get to the second error message above. The compute instances are spun after I run srun on the master node, but the job never gets to the compute nodes. I'm not sure what I did for my first deployment that made it work...

joeydumont commented 5 years ago

You mentioned that you neede to verify that using private IPs could work. Could you tell where to modify the codebase so that I could test that on my end? Even better, do you have a fork with that enabled?

demartinofra commented 5 years ago

I tried to switch the current logic to private ips: the nfs mounts work as expected but unfortunately the schedulers don't. Some additional work is required to reconfigure the schedulers so that they can properly function with private ips.

joeydumont commented 5 years ago

Ok, keep me posted!

chinmaychandak commented 3 years ago

Hi, I am facing an issue as described in #1935 inside AWS EKS (in which I am trying to use EFS as a persistent volume along with my VPC having a custom DHCP options set) which seems to be somewhat related to this.

Does anyone happen to know if there has been any update on these issues? If not, any suggestions on how I can work around this (1935)? Any help would be really appreciated.

Please also let me know if I should open a Feature Request on any other EKS-specific repository that might help increase visibility.

aws / aws-parallelcluster

Feature request: use private IPs throughout / remove dependence on DHCP options set #1201

1192