Closed andrew80k closed 5 years ago
I can see that the other mounts use the fully qualified hostname to mount:
2018-04-20 14:06:51,800 P1783 [INFO]
2018-04-20 14:06:51,801 P1783 [INFO] - mount ip-10-103-33-243.ec2.internal:/shared to /shared
2018-04-20 14:06:51,801 P1783 [INFO] * mount[/shared] action enable[2018-04-20T14:06:45+00:00] INFO: Processing mount[/shared] action enable (cfncluster::_compute_base_config line 30)
2018-04-20 14:06:51,801 P1783 [INFO] [2018-04-20T14:06:45+00:00] INFO: mount[/shared] enabled
2018-04-20 14:06:51,801 P1783 [INFO]
2018-04-20 14:06:51,801 P1783 [INFO] - enable ip-10-103-33-243.ec2.internal:/shared
2018-04-20 14:06:51,801 P1783 [INFO] * mount[/home] action mount[2018-04-20T14:06:45+00:00] INFO: Processing mount[/home] action mount (cfncluster::_compute_base_config line 38)
2018-04-20 14:06:51,801 P1783 [INFO] [2018-04-20T14:06:45+00:00] INFO: mount[/home] mounted
2018-04-20 14:06:51,801 P1783 [INFO]
2018-04-20 14:06:51,801 P1783 [INFO] - mount ip-10-103-33-243.ec2.internal:/home to /home
2018-04-20 14:06:51,801 P1783 [INFO] * mount[/home] action enable[2018-04-20T14:06:45+00:00] INFO: Processing mount[/home] action enable (cfncluster::_compute_base_config line 38)
2018-04-20 14:06:51,801 P1783 [INFO] [2018-04-20T14:06:45+00:00] INFO: mount[/home] enabled
2018-04-20 14:06:51,801 P1783 [INFO]
2018-04-20 14:06:51,801 P1783 [INFO] - enable ip-10-103-33-243.ec2.internal:/home
But when mounting the /opt/sge drive, it does not:
mount -t nfs -o hard,intr,noatime,vers=3,_netdev ip-10-103-33-243:/opt/sge /opt/sge
etdev ip-10-103-33-243:/opt/sge /opt/sge ----
2018-04-20 14:06:51,834 P1783 [INFO] STDOUT:
2018-04-20 14:06:51,834 P1783 [INFO] STDERR: mount.nfs: Failed to resolve server ip-10-103-33-243: Name or service not known
It looks like there's a couple issues.
First, we should be using the fqdn for the batch scheduler mount. I'll mark this as a bug and we'll fix that.
Second, some change in the AMI is causing the short domain name to not work. It would be interesting to see the contents of /etc/resolv.conf
on the compute nodes. I'm not sure why building an AMI based off of one of the CfnCluster base AMIs would change anything for you, but it sounds like that's next step.
Finally, do you need to use a custom AMI? Another option, which is likely to be way less finicky, is to use an EBS snapshot for the shared volume. This has the disadvantage of not being able to install packages via RPM, but the advantage of not having to get all the details of AMI startup just right (as you've discovered, there are many little details that have to be just right). The custom_ami path isn't nearly as well tested as using our AMIs, so you're a bit more on our own when you go down the custom_ami path.
@bwbarrett Thanks for getting back to me.
Awesome. Look forward to the fix.
resolv.conf:
; generated by /usr/sbin/dhclient-script
search mydomain.com
nameserver 169.254.169.253
nameserver 10.1.10.10
nameserver 10.1.8.12
nameserver 8.8.8.8
Looks like cfncluster expects the DHCP options to use default. I temporarily removed my dhcp options set for my networks and Voila! it works. I can add the search ec2.internal into my dhcp options set though, I guess.
I do need to use a custom AMI as there are a number of packages I want pre-installed to reduce spin up time of the compute nodes. It already takes 10 minutes or so, and I don't want to add to that.
I ran into this issue today with a client pointing back to their on-premise AD servers for DNS. I'm going to use a pre-install script that appends to /etc/hosts
based on a parameter store value in the interim.
The behaviour here seems a bit braindead (in my humble opinion 😄 ) - the CloudFormation template puts the private DNS name into the dna.json file, and then various Chef cookbooks treat this differently (namely Slurm and SGE truncate to the 'short' hostname when determining the nfs_master address, other mount resources do not). Would it not make more sense to use the private IP address in CFN and then use this in the Cookbooks so DNS is no longer an issue?
I have a closely related problem. When I create the pcluster with slurm as scheduler, the compute fleet fails to mount home, and thus slurm fails to load on the compute nodes. Here is a part of the file /var/log/cfn-init.log:
2019-05-05 20:04:30,188 [DEBUG] CloudFormation client initialized with endpoint https://cloudformation.us-east-2.amazonaws.com
2019-05-05 20:04:30,188 [DEBUG] Describing resource ComputeServerLaunchTemplate in stack parallelcluster-default
2019-05-05 20:04:30,402 [INFO] -----------------------Starting build-----------------------
2019-05-05 20:04:30,403 [DEBUG] Not setting a reboot trigger as scheduling support is not available
2019-05-05 20:04:30,403 [INFO] Running configSets: default
2019-05-05 20:04:30,404 [INFO] Running configSet default
2019-05-05 20:04:30,405 [INFO] Running config deployConfigFiles
2019-05-05 20:04:30,405 [DEBUG] No packages specified
2019-05-05 20:04:30,406 [DEBUG] No groups specified
2019-05-05 20:04:30,406 [DEBUG] No users specified
2019-05-05 20:04:30,406 [DEBUG] No sources specified
2019-05-05 20:04:30,406 [DEBUG] Writing content to /etc/chef/client.rb
2019-05-05 20:04:30,406 [DEBUG] Setting mode for /etc/chef/client.rb to 000644
2019-05-05 20:04:30,406 [DEBUG] Setting owner 0 and group 0 for /etc/chef/client.rb
2019-05-05 20:04:30,406 [DEBUG] Writing content to /tmp/dna.json
2019-05-05 20:04:30,406 [DEBUG] Content will be serialized as a JSON structure
2019-05-05 20:04:30,407 [DEBUG] Setting mode for /tmp/dna.json to 000644
2019-05-05 20:04:30,407 [DEBUG] Setting owner 0 and group 0 for /tmp/dna.json
2019-05-05 20:04:30,407 [DEBUG] Writing content to /tmp/extra.json
2019-05-05 20:04:30,407 [DEBUG] Setting mode for /tmp/extra.json to 000644
2019-05-05 20:04:30,407 [DEBUG] Setting owner 0 and group 0 for /tmp/extra.json
2019-05-05 20:04:30,407 [DEBUG] Running command jq
2019-05-05 20:04:30,407 [DEBUG] No test for command jq
2019-05-05 20:04:30,460 [INFO] Command jq succeeded
2019-05-05 20:04:30,460 [DEBUG] Command jq output:
2019-05-05 20:04:30,460 [DEBUG] Running command mkdir
2019-05-05 20:04:30,460 [DEBUG] No test for command mkdir
2019-05-05 20:04:30,463 [INFO] Command mkdir succeeded
2019-05-05 20:04:30,463 [DEBUG] Command mkdir output:
2019-05-05 20:04:30,463 [DEBUG] Running command touch
2019-05-05 20:04:30,463 [DEBUG] No test for command touch
2019-05-05 20:04:30,466 [INFO] Command touch succeeded
2019-05-05 20:04:30,466 [DEBUG] Command touch output:
2019-05-05 20:04:30,466 [DEBUG] No services specified
2019-05-05 20:04:30,467 [INFO] Running config chefPrepEnv
2019-05-05 20:04:30,467 [DEBUG] No packages specified
2019-05-05 20:04:30,467 [DEBUG] No groups specified
2019-05-05 20:04:30,467 [DEBUG] No users specified
2019-05-05 20:04:30,467 [DEBUG] No sources specified
2019-05-05 20:04:30,468 [DEBUG] No files specified
2019-05-05 20:04:30,468 [DEBUG] Running command chef
2019-05-05 20:04:30,468 [DEBUG] No test for command chef
2019-05-05 20:04:41,053 [INFO] Command chef succeeded
2019-05-05 20:04:41,053 [DEBUG] Command chef output: Starting Chef Client, version 14.2.0
[2019-05-05T20:04:37+00:00] WARN: Run List override has been provided.
[2019-05-05T20:04:37+00:00] WARN: Run List override has been provided.
[2019-05-05T20:04:37+00:00] WARN: Original Run List: [recipe[aws-parallelcluster::slurm_config]]
[2019-05-05T20:04:37+00:00] WARN: Original Run List: [recipe[aws-parallelcluster::slurm_config]]
[2019-05-05T20:04:37+00:00] WARN: Overridden Run List: [recipe[aws-parallelcluster::_prep_env]]
[2019-05-05T20:04:37+00:00] WARN: Overridden Run List: [recipe[aws-parallelcluster::_prep_env]]
resolving cookbooks for run list: ["aws-parallelcluster::_prep_env"]
Synchronizing Cookbooks:
- aws-parallelcluster (2.3.1)
- build-essential (8.1.1)
- poise-python (1.7.0)
- tar (2.1.1)
- selinux (2.1.1)
- nfs (2.5.1)
- sysctl (1.0.5)
- yum (5.1.0)
- yum-epel (3.1.0)
- openssh (2.6.3)
- apt (7.0.0)
- hostname (0.4.2)
- line (1.0.6)
- seven_zip (3.1.0)
- mingw (2.1.0)
- poise (2.8.2)
- poise-languages (2.1.2)
- ohai (5.2.5)
- iptables (4.5.0)
- hostsfile (3.0.1)
- windows (5.3.0)
- poise-archive (1.5.0)
Installing Cookbook Gems:
Compiling Cookbooks...
Converging 7 resources
Recipe: aws-parallelcluster::_prep_env
* directory[/etc/parallelcluster] action create (up to date)
* directory[/opt/parallelcluster] action create (up to date)
* directory[/opt/parallelcluster/scripts] action create (up to date)
* template[/etc/parallelcluster/cfnconfig] action create
- create new file /etc/parallelcluster/cfnconfig
- update content in file /etc/parallelcluster/cfnconfig from none to 5c3fb6
--- /etc/parallelcluster/cfnconfig 2019-05-05 20:04:41.006049000 +0000
+++ /etc/parallelcluster/.chef-cfnconfig20190505-2560-12args7 2019-05-05 20:04:41.006049000 +0000
@@ -1 +1,18 @@
+stack_name=parallelcluster-default
+cfn_preinstall=NONE
+cfn_preinstall_args=NONE
+cfn_postinstall=NONE
+cfn_postinstall_args="NONE"
+cfn_region=us-east-2
+cfn_scheduler=slurm
+cfn_scheduler_slots=vcpus
+cfn_instance_slots=2
+cfn_encrypted_ephemeral=false
+cfn_ephemeral_dir=/scratch
+cfn_shared_dir=/shared
+cfn_proxy=NONE
+cfn_node_type=ComputeFleet
+cfn_cluster_user=ec2-user
+cfn_sqs_queue=https://sqs.us-east-2.amazonaws.com/605256951436/parallelcluster-default-SQS-1WJF9N3KPQNH
+cfn_master=ip-172-31-32-74.us-east-2.compute.internal
- change mode from '' to '0644'
* link[/opt/parallelcluster/cfnconfig] action create
- create symlink at /opt/parallelcluster/cfnconfig to /etc/parallelcluster/cfnconfig
* cookbook_file[fetch_and_run] action create
- create new file /opt/parallelcluster/scripts/fetch_and_run
- update content in file /opt/parallelcluster/scripts/fetch_and_run from none to 901931
--- /opt/parallelcluster/scripts/fetch_and_run 2019-05-05 20:04:41.026049000 +0000
+++ /opt/parallelcluster/scripts/.chef-fetch_and_run20190505-2560-en043t 2019-05-05 20:04:41.026049000 +0000
@@ -1 +1,66 @@
+#!/bin/bash
+
+. /etc/parallelcluster/cfnconfig
+
+# Error exit function
+function error_exit () {
+ script=`basename $0`
+ echo "parallelcluster: $script - $1"
+ logger -t parallelcluster "$script - $1"
+ exit 1
+}
+
+function download_run (){
+ url=$1
+ scheme=$(echo "${url}"| cut -d: -f1)
+ tmpfile=$(mktemp)
+ trap "/bin/rm $tmpfile" RETURN
+ if [ "${scheme}" == "s3" ]; then
+ aws --region ${cfn_region} s3 cp ${url} - > $tmpfile || return 1
+ else
+ wget -qO- ${url} > $tmpfile || return 1
+ fi
+ chmod +x $tmpfile || return 1
+ $tmpfile $@ || error_exit "Failed to run boot_as_master $ACTION, $file failed with non 0 return code: $?"
+}
+
+function run_preinstall () {
+ if [ "${cfn_preinstall}" != "NONE" ]; then
+ file="${cfn_preinstall}"
+ if [ "${cfn_preinstall_args}" != "NONE" ]; then
+ download_run ${cfn_preinstall} ${cfn_preinstall_args}
+ else
+ download_run ${cfn_preinstall}
+ fi
+ fi || error_exit "Failed to run boot_as_master preinstall"
+}
+
+function run_postinstall () {
+ RC=0
+ if [ "${cfn_postinstall}" != "NONE" ]; then
+ file="${cfn_postinstall}"
+ if [ "${cfn_postinstall_args}" != "NONE" ]; then
+ download_run ${cfn_postinstall} ${cfn_postinstall_args}
+ else
+ download_run ${cfn_postinstall}
+ fi
+ fi || error_exit "Failed to run boot_as_master postinstall"
+}
+
+ACTION=${1#?}
+
+case $ACTION in
+ preinstall)
+ run_preinstall
+ ;;
+
+ postinstall)
+ run_postinstall
+ ;;
+
+ *)
+ echo "Unknown action. Exit gracefully"
+ exit 0
+
+esac
- change mode from '' to '0755'
- change owner from '' to 'root'
- change group from '' to 'root'
* cookbook_file[compute_ready] action create
- create new file /opt/parallelcluster/scripts/compute_ready
- update content in file /opt/parallelcluster/scripts/compute_ready from none to 78d5c3
--- /opt/parallelcluster/scripts/compute_ready 2019-05-05 20:04:41.030049000 +0000
+++ /opt/parallelcluster/scripts/.chef-compute_ready20190505-2560-1pybjh8 2019-05-05 20:04:41.030049000 +0000
@@ -1 +1,11 @@
+#!/bin/bash
+
+. /etc/parallelcluster/cfnconfig
+
+# Notify compute is ready
+instance_id_url="http://169.254.169.254/latest/meta-data/instance-id"
+instance_id=$(curl --retry 3 --retry-delay 0 --silent --fail ${instance_id_url})
+local_hostname_url="http://169.254.169.254/latest/meta-data/local-hostname"
+local_hostname=$(curl --retry 3 --retry-delay 0 --silent --fail ${local_hostname_url})
+aws --region ${cfn_region} sqs send-message --queue-url ${cfn_sqs_queue} --message-body '{"Type" : "Notification", "Message" : "{\"StatusCode\":\"Complete\",\"Description\":\"Succesfully launched '${instance_id}'\",\"Event\":\"parallelcluster:COMPUTE_READY\",\"EC2InstanceId\":\"'${instance_id}'\",\"Slots\":\"'${cfn_instance_slots}'\",\"LocalHostname\":\"'${local_hostname}'\"}"}'
- change mode from '' to '0755'
- change owner from '' to 'root'
- change group from '' to 'root'
[2019-05-05T20:04:41+00:00] WARN: Skipping final node save because override_runlist was given
[2019-05-05T20:04:41+00:00] WARN: Skipping final node save because override_runlist was given
Running handlers:
Running handlers complete
Chef Client finished, 4/7 resources updated in 07 seconds
2019-05-05 20:04:41,054 [DEBUG] No services specified
2019-05-05 20:04:41,055 [INFO] Running config shellRunPreInstall
2019-05-05 20:04:41,055 [DEBUG] No packages specified
2019-05-05 20:04:41,055 [DEBUG] No groups specified
2019-05-05 20:04:41,055 [DEBUG] No users specified
2019-05-05 20:04:41,055 [DEBUG] No sources specified
2019-05-05 20:04:41,055 [DEBUG] No files specified
2019-05-05 20:04:41,055 [DEBUG] Running command runpreinstall
2019-05-05 20:04:41,055 [DEBUG] No test for command runpreinstall
2019-05-05 20:04:41,059 [INFO] Command runpreinstall succeeded
2019-05-05 20:04:41,059 [DEBUG] Command runpreinstall output:
2019-05-05 20:04:41,059 [DEBUG] No services specified
2019-05-05 20:04:41,060 [INFO] Running config chefConfig
2019-05-05 20:04:41,060 [DEBUG] No packages specified
2019-05-05 20:04:41,060 [DEBUG] No groups specified
2019-05-05 20:04:41,060 [DEBUG] No users specified
2019-05-05 20:04:41,061 [DEBUG] No sources specified
2019-05-05 20:04:41,061 [DEBUG] No files specified
2019-05-05 20:04:41,061 [DEBUG] Running command chef
2019-05-05 20:04:41,061 [DEBUG] No test for command chef
2019-05-05 20:07:24,790 [ERROR] Command chef (chef-client --local-mode --config /etc/chef/client.rb --log_level auto --force-formatter --no-color --chef-zero-port 8889 --json-attributes /etc/chef/dna.json) failed
2019-05-05 20:07:24,790 [DEBUG] Command chef output: Starting Chef Client, version 14.2.0
resolving cookbooks for run list: ["aws-parallelcluster::slurm_config"]
Synchronizing Cookbooks:
- build-essential (8.1.1)
- poise-python (1.7.0)
- aws-parallelcluster (2.3.1)
- tar (2.1.1)
- selinux (2.1.1)
- nfs (2.5.1)
- sysctl (1.0.5)
- yum (5.1.0)
- yum-epel (3.1.0)
- openssh (2.6.3)
- apt (7.0.0)
- hostname (0.4.2)
- line (1.0.6)
- seven_zip (3.1.0)
- mingw (2.1.0)
- poise (2.8.2)
- poise-languages (2.1.2)
- ohai (5.2.5)
- iptables (4.5.0)
- hostsfile (3.0.1)
- windows (5.3.0)
- poise-archive (1.5.0)
Installing Cookbook Gems:
Compiling Cookbooks...
Converging 120 resources
Recipe: yum::default
* yum_globalconfig[/etc/yum.conf] action create
* template[/etc/yum.conf] action create (up to date)
(up to date)
Recipe: aws-parallelcluster::base_install
* execute[yum-config-manager_skip_if_unavail] action run
- execute yum-config-manager --setopt=*.skip_if_unavailable=1 --save
* build_essential[] action install
* yum_package[autoconf, bison, flex, gcc, gcc-c++, gettext, kernel-devel, make, m4, ncurses-devel, patch] action install (up to date)
(up to date)
Recipe: aws-parallelcluster::_setup_python
* bash[pin pip to version 18.0] action run
- execute "bash" "/tmp/chef-script20190505-2887-127wers"
* python_runtime[2] action install
* poise_languages_system[python27] action install
(up to date)
* yum_package[python27] action nothing (skipped due to action :nothing)
(up to date)
* python_runtime_pip[2] action install (up to date)
* python_package[setuptools] action install (up to date)
* python_package[wheel] action install (up to date)
* python_package[virtualenv] action install (up to date)
(up to date)
Recipe: aws-parallelcluster::base_install
* yum_package[vim] action install (up to date)
* yum_package[ksh] action install (up to date)
* yum_package[tcsh] action install (up to date)
* yum_package[zsh] action install (up to date)
* yum_package[openssl-devel] action install (up to date)
* yum_package[ncurses-devel] action install (up to date)
* yum_package[pam-devel] action install (up to date)
* yum_package[net-tools] action install (up to date)
* yum_package[openmotif-devel] action install (up to date)
* yum_package[libXmu-devel] action install (up to date)
* yum_package[hwloc-devel] action install (up to date)
* yum_package[db4-devel] action install (up to date)
* yum_package[tcl-devel] action install (up to date)
* yum_package[automake] action install (up to date)
* yum_package[autoconf] action install (up to date)
* yum_package[pyparted] action install (up to date)
* yum_package[libtool] action install (up to date)
* yum_package[httpd] action install (up to date)
* yum_package[boost-devel] action install (up to date)
* yum_package[redhat-lsb] action install (up to date)
* yum_package[mlocate] action install (up to date)
* yum_package[mpich-devel] action install (up to date)
* yum_package[openmpi-devel] action install (up to date)
* yum_package[R] action install (up to date)
* yum_package[atlas-devel] action install (up to date)
* yum_package[fftw-devel] action install (up to date)
* yum_package[libffi-devel] action install (up to date)
* yum_package[openssl-devel] action install (up to date)
* yum_package[dkms] action install (up to date)
* yum_package[mysql-devel] action install (up to date)
* yum_package[libedit-devel] action install (up to date)
* yum_package[postgresql-devel] action install (up to date)
* yum_package[postgresql-server] action install (up to date)
* yum_package[sendmail] action install (up to date)
* yum_package[cmake] action install (up to date)
* yum_package[byacc] action install (up to date)
* yum_package[libglvnd-devel] action install (up to date)
* yum_package[mdadm] action install (up to date)
Recipe: openssh::default
* yum_package[openssh-clients, openssh-server] action install (up to date)
* template[/etc/ssh/ssh_config] action create (up to date)
* template[sshd_ca_keys_file] action create (up to date)
* template[sshd_revoked_keys_file] action create (up to date)
* template[/etc/ssh/sshd_config] action create
- update content in file /etc/ssh/sshd_config from e74212 to 31db89
--- /etc/ssh/sshd_config 2019-05-05 20:04:19.716000000 +0000
+++ /etc/ssh/.chef-sshd_config20190505-2887-1vwl5xz 2019-05-05 20:04:49.954049000 +0000
@@ -16,4 +16,5 @@
TrustedUserCAKeys /etc/ssh/ca_keys
UsePAM yes
X11Forwarding yes
+
* execute[sshd-config-check] action run
- execute /usr/sbin/sshd -t
* execute[sshd-config-check] action nothing (skipped due to action :nothing)
* service[ssh] action enable (up to date)
* service[ssh] action start (up to date)
Recipe: aws-parallelcluster::base_install
* selinux_state[SELinux Disabled] action disabled
* template[disabled selinux config] action create (up to date)
(up to date)
* directory[/etc/parallelcluster] action create (up to date)
* directory[/opt/parallelcluster] action create (up to date)
* directory[/opt/parallelcluster/sources] action create (up to date)
* directory[/opt/parallelcluster/scripts] action create (up to date)
* directory[/opt/parallelcluster/licenses] action create (up to date)
* cookbook_file[AWS-ParallelCluster-License-README.txt] action create (up to date)
* python_package[awscli] action install (up to date)
* python_package[boto3] action install (up to date)
Recipe: nfs::_common
* yum_package[nfs-utils] action install (up to date)
* yum_package[rpcbind] action install (up to date)
* directory[/etc/sysconfig] action create (skipped due to only_if)
* template[/etc/sysconfig/nfs] action create
- update content in file /etc/sysconfig/nfs from 47c274 to a44fb7
--- /etc/sysconfig/nfs 2019-04-02 21:24:00.298366000 +0000
+++ /etc/sysconfig/.chef-nfs20190505-2887-jjsdjs 2019-05-05 20:04:52.274049000 +0000
@@ -1,2 +1,2 @@
-# Generated by Chef for ip-172-30-2-227.ec2.internal# Local modifications will be overwritten.
+# Generated by Chef for ip-172-31-47-65.us-east-2.compute.internal# Local modifications will be overwritten.
* service[portmap] action restart
- restart service service[portmap]
* service[lock] action restart
- restart service service[lock]
* service[portmap] action start (up to date)
* service[portmap] action enable (up to date)
* service[lock] action start (up to date)
* service[lock] action enable (up to date)
Recipe: aws-parallelcluster::base_install
* service[rpcbind] action start (skipped due to only_if)
* service[rpcbind] action enable (skipped due to only_if)
Recipe: nfs::server
* service[nfs] action start (up to date)
* service[nfs] action enable (up to date)
Recipe: nfs::_idmap
* template[/etc/idmapd.conf] action create
- update content in file /etc/idmapd.conf from d52ed3 to f2e8ab
--- /etc/idmapd.conf 2019-04-02 21:24:03.522366000 +0000
+++ /etc/.chef-idmapd20190505-2887-qbk1cs.conf 2019-05-05 20:04:54.978049000 +0000
@@ -5,7 +5,7 @@
# The following should be set to the local NFSv4 domain name
# The default is the host's DNS domain name.
-Domain = ec2.internal
+Domain = us-east-2.compute.internal
# The following is a comma-separated list of Kerberos realm
# names that should be considered to be equivalent to the
* service[idmap] action restart
- restart service service[idmap]
* service[idmap] action start (up to date)
* service[idmap] action enable (up to date)
Recipe: aws-parallelcluster::base_install
* cookbook_file[configure-pat.sh] action create (up to date)
* cookbook_file[setup-ephemeral-drives.sh] action create (up to date)
Recipe: aws-parallelcluster::_ec2_udev_rules
* cookbook_file[ec2-volid.rules] action create (up to date)
* cookbook_file[parallelcluster-ebsnvme-id] action create (up to date)
* cookbook_file[ec2_dev_2_volid.py] action create (up to date)
* cookbook_file[ec2blkdev-init] action create (up to date)
* cookbook_file[attachVolume.py] action create (up to date)
* service[ec2blkdev] action enable (up to date)
* service[ec2blkdev] action start
- start service service[ec2blkdev]
Recipe: aws-parallelcluster::base_install
* remote_file[/usr/bin/ec2-metadata] action create (up to date)
* python_package[aws-parallelcluster-node] action install (up to date)
* python_package[supervisor] action install (up to date)
* cookbook_file[supervisord.conf] action create (up to date)
* cookbook_file[supervisord-init] action create (up to date)
* cookbook_file[ami_cleanup.sh] action create (up to date)
Recipe: aws-parallelcluster::_lustre_install
* yum_package[lustre-client] action install (up to date)
Recipe: hostname::default
* file[/etc/hostname] action create
- create new file /etc/hostname
- update content in file /etc/hostname from none to 05dbf4
--- /etc/hostname 2019-05-05 20:05:12.790049000 +0000
+++ /etc/.chef-hostname20190505-2887-1k3yfcy 2019-05-05 20:05:12.790049000 +0000
@@ -1 +1,2 @@
+ip-172-31-47-65
- change mode from '' to '0644'
* ohai[reload_hostname] action reload
- re-run ohai and merge results into node attributes
* execute[hostname ip-172-31-47-65] action run (skipped due to only_if)
* hostsfile_entry[localhost] action append
Recipe: <Dynamically Defined Resource>
* file[/etc/hosts] action create
- update content in file /etc/hosts from 0915c8 to ecf0d1
--- /etc/hosts 2018-11-16 23:07:41.111235386 +0000
+++ /etc/.chef-hosts20190505-2887-17hwa2p 2019-05-05 20:05:12.866049000 +0000
@@ -1,3 +1,12 @@
-127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
-::1 localhost6 localhost6.localdomain6
+#
+# This file is managed by Chef, using the hostsfile cookbook.
+# Editing this file by hand is highly discouraged!
+#
+# Comments containing an @ sign should not be modified or else
+# hostsfile will be unable to guarantee relative priority in
+# future Chef runs!
+#
+
+127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
+::1 localhost6 localhost6.localdomain6
- Append hostsfile_entry[localhost]
Recipe: hostname::default
* hostsfile_entry[set hostname] action create
Recipe: <Dynamically Defined Resource>
* file[/etc/hosts] action create
- update content in file /etc/hosts from ecf0d1 to 9b1d3e
--- /etc/hosts 2019-05-05 20:05:12.866049000 +0000
+++ /etc/.chef-hosts20190505-2887-1ee6kol 2019-05-05 20:05:12.870049000 +0000
@@ -9,4 +9,5 @@
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost6 localhost6.localdomain6
+172.31.47.65 ip-172-31-47-65.us-east-2.compute.internal ip-172-31-47-65
- Create hostsfile_entry[set hostname]
Recipe: hostname::default
* ohai[reload_hostname] action reload
- re-run ohai and merge results into node attributes
* ohai[reload_hostname] action nothing (skipped due to action :nothing)
Recipe: aws-parallelcluster::base_config
* execute[setup ephemeral] action run
- execute /usr/local/sbin/setup-ephemeral-drives.sh
Recipe: aws-parallelcluster::_compute_base_config
* directory[/shared] action create
- create new directory /shared
- change mode from '' to '01777'
- change owner from '' to 'root'
- change group from '' to 'root'
* mount[/home] action mount
================================================================================
Error executing action `mount` on resource 'mount[/home]'
================================================================================
Mixlib::ShellOut::ShellCommandFailed
------------------------------------
Expected process to exit with [0], but received '32'
---- Begin output of mount -t nfs -o hard,intr,noatime,vers=3,_netdev ip-172-31-32-74.us-east-2.compute.internal:/home /home ----
STDOUT:
STDERR: mount.nfs: Connection timed out
---- End output of mount -t nfs -o hard,intr,noatime,vers=3,_netdev ip-172-31-32-74.us-east-2.compute.internal:/home /home ----
Ran mount -t nfs -o hard,intr,noatime,vers=3,_netdev ip-172-31-32-74.us-east-2.compute.internal:/home /home returned 32
Resource Declaration:
---------------------
# In /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster/recipes/_compute_base_config.rb
57: mount '/home' do
58: device "#{nfs_master}:/home"
59: fstype 'nfs'
60: options 'hard,intr,noatime,vers=3,_netdev'
61: action %i[mount enable]
62: end
63:
Compiled Resource:
------------------
# Declared in /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster/recipes/_compute_base_config.rb:57:in `from_file'
mount("/home") do
action [:mount, :enable]
default_guard_interpreter :default
declared_type :mount
cookbook_name "aws-parallelcluster"
recipe_name "_compute_base_config"
device "ip-172-31-32-74.us-east-2.compute.internal:/home"
fstype "nfs"
options ["hard", "intr", "noatime", "vers=3", "_netdev"]
mount_point "/home"
supports {:remount=>false}
end
System Info:
------------
chef_version=14.2.0
platform=amazon
platform_version=2018.03
ruby=ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux]
program_name=/usr/bin/chef-client
executable=/opt/chef/bin/chef-client
Recipe: openssh::default
* service[ssh] action restart
- restart service service[ssh]
Recipe: nfs::server
* service[nfs] action restart
- restart service service[nfs]
Running handlers:
[2019-05-05T20:07:24+00:00] ERROR: Running exception handlers
[2019-05-05T20:07:24+00:00] ERROR: Running exception handlers
Running handlers complete
[2019-05-05T20:07:24+00:00] ERROR: Exception handlers complete
[2019-05-05T20:07:24+00:00] ERROR: Exception handlers complete
Chef Client failed. 21 resources updated in 02 minutes 42 seconds
[2019-05-05T20:07:24+00:00] FATAL: Stacktrace dumped to /etc/chef/local-mode-cache/cache/chef-stacktrace.out
[2019-05-05T20:07:24+00:00] FATAL: Stacktrace dumped to /etc/chef/local-mode-cache/cache/chef-stacktrace.out
[2019-05-05T20:07:24+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2019-05-05T20:07:24+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2019-05-05T20:07:24+00:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: mount[/home] (aws-parallelcluster::_compute_base_config line 57) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '32'
---- Begin output of mount -t nfs -o hard,intr,noatime,vers=3,_netdev ip-172-31-32-74.us-east-2.compute.internal:/home /home ----
STDOUT:
STDERR: mount.nfs: Connection timed out
---- End output of mount -t nfs -o hard,intr,noatime,vers=3,_netdev ip-172-31-32-74.us-east-2.compute.internal:/home /home ----
Ran mount -t nfs -o hard,intr,noatime,vers=3,_netdev ip-172-31-32-74.us-east-2.compute.internal:/home /home returned 32
[2019-05-05T20:07:24+00:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: mount[/home] (aws-parallelcluster::_compute_base_config line 57) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '32'
---- Begin output of mount -t nfs -o hard,intr,noatime,vers=3,_netdev ip-172-31-32-74.us-east-2.compute.internal:/home /home ----
STDOUT:
STDERR: mount.nfs: Connection timed out
---- End output of mount -t nfs -o hard,intr,noatime,vers=3,_netdev ip-172-31-32-74.us-east-2.compute.internal:/home /home ----
Ran mount -t nfs -o hard,intr,noatime,vers=3,_netdev ip-172-31-32-74.us-east-2.compute.internal:/home /home returned 32
2019-05-05 20:07:24,790 [ERROR] Error encountered during build of chefConfig: Command chef failed
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/cfnbootstrap/construction.py", line 542, in run_config
CloudFormationCarpenter(config, self._auth_config).build(worklog)
File "/usr/lib/python2.7/dist-packages/cfnbootstrap/construction.py", line 260, in build
changes['commands'] = CommandTool().apply(self._config.commands)
File "/usr/lib/python2.7/dist-packages/cfnbootstrap/command_tool.py", line 117, in apply
raise ToolError(u"Command %s failed" % name)
ToolError: Command chef failed
2019-05-05 20:07:24,841 [ERROR] -----------------------BUILD FAILED!------------------------
2019-05-05 20:07:24,841 [ERROR] Unhandled exception during build: Command chef failed
Traceback (most recent call last):
File "/opt/aws/bin/cfn-init", line 171, in <module>
worklog.build(metadata, configSets)
File "/usr/lib/python2.7/dist-packages/cfnbootstrap/construction.py", line 129, in build
Contractor(metadata).build(configSets, self)
File "/usr/lib/python2.7/dist-packages/cfnbootstrap/construction.py", line 530, in build
self.run_config(config, worklog)
File "/usr/lib/python2.7/dist-packages/cfnbootstrap/construction.py", line 542, in run_config
CloudFormationCarpenter(config, self._auth_config).build(worklog)
File "/usr/lib/python2.7/dist-packages/cfnbootstrap/construction.py", line 260, in build
changes['commands'] = CommandTool().apply(self._config.commands)
File "/usr/lib/python2.7/dist-packages/cfnbootstrap/command_tool.py", line 117, in apply
raise ToolError(u"Command %s failed" % name)
ToolError: Command chef failed
2019-05-05 20:07:25,058 [DEBUG] CloudFormation client initialized with endpoint https://cloudformation.us-east-2.amazonaws.com
2019-05-05 20:07:25,058 [DEBUG] Signaling resource ComputeFleet in stack parallelcluster-default with unique ID i-0cf9713a0bf8c2643 and status FAILURE
If I try to run the command manually I get:
$ sudo mount -v -t nfs -o hard,intr,noatime,vers=3,_netdev ip-172-31-32-74.us-east-2.compute.internal:/home /home
mount.nfs: timeout set for Mon May 6 13:19:51 2019
mount.nfs: trying text-based options 'hard,intr,vers=3,addr=172.31.32.74'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: portmap query retrying: RPC: Timed out
mount.nfs: prog 100003, trying vers=3, prot=17
mount.nfs: portmap query failed: RPC: Timed out
Any ideas on how to solve this?
I've found the cause of the problem, it was the settings of the security groups which did not allow NFS/TCP access.
@cgorgulla could you explain what settings you used for your security groups? I am having the same issue, and am getting the same log as you did.
@jcpasion I don't remember what exactly I changed at that time. To find out if the SGs are also the cause of your problem, you could just enable all inbound and outbound traffic temporarily for testing purposes. And if it works, then you could narrow again the inbound/outbound rules.
Note the above is not the same issue as the original opened by me. The mounts have the FQDN defined in the mount command, which the original issue did not.
The fact that this issue is still open is somewhat disappointing as the fix for it seems fairly straightforward.
Hi @andrew80k, the issue you were reporting has been fixed in version 2.4.0. From CHANGELOG: Always use full master FQDN when mounting NFS on compute nodes. This solves some issues occurring with some networking setups and custom DNS configurations
Here is the pull request: https://github.com/aws/aws-parallelcluster-cookbook/pull/308
I'm going to resolve this issue. Feel free to reopen if necessary.
The behaviour here seems a bit braindead (in my humble opinion 😄 ) - the CloudFormation template puts the private DNS name into the dna.json file, and then various Chef cookbooks treat this differently (namely Slurm and SGE truncate to the 'short' hostname when determining the nfs_master address, other mount resources do not). Would it not make more sense to use the private IP address in CFN and then use this in the Cookbooks so DNS is no longer an issue?
In my case, it's failing to create dna.json file. I'm suspecting that to be the reason the nfs_master_address
is not determined, as it ends with the 'mount /home' error on 'nfs_master' device. I see the following error in my cfn-init.log file of my compute node -
2020-05-21 14:04:31,950 [DEBUG] Running command chef 2020-05-21 14:04:31,950 [DEBUG] No test for command chef 2020-05-21 14:07:55,315 [ERROR] Command chef (chef-client --local-mode --config /etc/chef/client.rb --log_level auto --force-formatter --no-color --chef-zero-port 8889 --json-attributes /etc/chef/dna.json) failed
In the next step, however, it continues to install cookbook gems.
I realised my issue is exactly the same as mentioned here, but opening NFS ports in my security group did not help.
I'm using a custom AMI created using EC2 Image Builder service. Any idea what could be going wrong here?
Using a base public AMI:
cfncluster-1.4.2-centos7-hvm-201801112350 (ami-86795cfc)
to create a custom AMI for Nonmem and NLME installation.
I don't really do anything to it other than add some stuff on the /opt directory. But when the compute nodes spin up I get the error below and the compute node doesn't finish the startup.
Yet the drive is mounted. So I'm not sure what to do with it. Happened on both the compute nodes.
Anyone seen this? Or know what to do with it?