aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
Apache License 2.0
828 stars 312 forks source link

can not ssh to worker nodes from master nodes using cfncluster to build HPC cluster #284

Closed miker2746 closed 6 years ago

miker2746 commented 6 years ago


I started to learn how to use cfncluster to set up an HPC cluster in AWS. After I configured the config file of cfncluster I used it to build the cluster. It was successfully been set up. But after I connect to the master node, I found few problems with my cluster.

  1. I can't ssh to other worker nodes from my master node.
  2. the 'shared_dir' of the master node wasn't been shared to the worker nodes.

Could someone please tell me how to solve this two problems? Thank you very much


rajachan commented 6 years ago

Michael - Given the two symptoms, I wonder if the cfncluster-cookbooks failed to run successfully. Can you share your cfncluster config file, /var/log/cloud-init.log, and /var/log/cfn-init.log files?

miker2746 commented 6 years ago

Hi Rajachan,

  1. the following is the /var/log/cloud-init.log file of my master node.
2018-01-08 07:47:05,599 -[DEBUG]: Cloud-init v. 17.1 running 'init-local' at Mon, 08 Jan 2018 07:47:05 +0000. Up 27.54 seconds.
2018-01-08 07:47:05,599 -[DEBUG]: No kernel command line url found.
2018-01-08 07:47:05,599 -[DEBUG]: Closing stdin.
2018-01-08 07:47:05,601 -[DEBUG]: Writing to /var/log/cloud-init.log - ab: [644] 0 bytes
2018-01-08 07:47:05,602 -[DEBUG]: Changing the ownership of /var/log/cloud-init.log to 104:4
2018-01-08 07:47:05,602 -[DEBUG]: Attempting to remove /var/lib/cloud/instance/boot-finished
2018-01-08 07:47:05,602 -[DEBUG]: Attempting to remove /var/lib/cloud/data/no-net
2018-01-08 07:47:05,603 -[DEBUG]: start: init-local/check-cache: attempting to read from cache [check]
2018-01-08 07:47:05,603 -[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
2018-01-08 07:47:05,603 -[DEBUG]: no cache found
2018-01-08 07:47:05,603 -[DEBUG]: finish: init-local/check-cache: SUCCESS: no cache found
2018-01-08 07:47:05,603 -[DEBUG]: Attempting to remove /var/lib/cloud/instance
2018-01-08 07:47:05,606 -[DEBUG]: Using distro class <class 'cloudinit.distros.ubuntu.Distro'>
  1. I can't open the /var/log/cfn-init.log file. Every time I open this file the instance froze and I have to re-connect to the instance.

  2. the configure file was as follows. I only set the keyname and the vpc_setting and the rest setting were just the default settings.

    [cluster default]
    # Name of an existing EC2 KeyPair to enable SSH access to the instances.
    key_name = cfncluster-keypair1
    # Override path to cloudformation in S3
    # (defaults to<aws_region_name>/templates/cfncluster-<version>.cfn.json)
    #template_url =
    # Cluster Server EC2 instance type
    # (defaults to t2.micro for default template)
    #compute_instance_type = t2.micro
    # Master Server EC2 instance type
    # (defaults to t2.micro for default template
    #master_instance_type = t2.micro
    # Inital number of EC2 instances to launch as compute nodes in the cluster.
    # (defaults to 2 for default template)
    #initial_queue_size = 0
    # Maximum number of EC2 instances that can be launched in the cluster.
    # (defaults to 10 for the default template)
    #max_queue_size = 1
    # Boolean flag to set autoscaling group to maintain initial size and scale back
    # (defaults to false for the default template)
    #maintain_initial_size = false
    # Cluster scheduler
    # (defaults to sge for the default template)
    #scheduler = sge
    # Type of cluster to launch i.e. ondemand or spot
    # (defaults to ondemand for the default template)
    #cluster_type = ondemand
    # Spot price for the ComputeFleet
    #spot_price = 0.00
    # ID of a Custom AMI, to use instead of published AMI's
    #custom_ami = ami-9802b1e1
    #custom_ami = ami-ff8d1886
    #custom_ami = ami-62fa6e1b
    #custom_ami = ami-898b1ff0

Specify S3 resource which cfncluster nodes will be granted read-only access

(defaults to NONE for the default template)

s3_read_resource = NONE

Specify S3 resource which cfncluster nodes will be granted read-write access

(defaults to NONE for the default template)

s3_read_write_resource = NONE

URL to a preinstall script. This is executed before any of the bootas* scripts are run

(defaults to NONE for the default template)

pre_install = NONE

Arguments to be passed to preinstall script

(defaults to NONE for the default template)

pre_install_args = NONE

URL to a postinstall script. This is executed after any of the bootas* scripts are run

(defaults to NONE for the default template)

post_install = NONE

Arguments to be passed to postinstall script

(defaults to NONE for the default template)

post_install_args = NONE

HTTP(S) proxy server, typically http://x.x.x.x:8080

(defaults to NONE for the default template)

proxy_server = NONE

Cluster placement group. This placement group must already exist.

(defaults to NONE for the default template)

placement_group = cfncluster-pg-1

Cluster placment logic. This enables the whole cluster or only compute to use the placement group

(defaults to cluster in the default template)

placement = cluster

Path/mountpoint for ephemeral drives

(defaults to /scratch in the default template)

ephemeral_dir = /scratch

Path/mountpoint for shared EBS volume

(defaults to /shared in the default template)

shared_dir = /shared

Encrypted ephemeral drives. In-memory keys, non-recoverable.

(defaults to false in default template)

encrypted_ephemeral = false

MasterServer root volume size in GB. (AMI must support growroot)

(defaults to 10 in default template)

master_root_volume_size = 10

ComputeFleet root volume size in GB. (AMI must support growroot)

(defaults to 10 in default template)

compute_root_volume_size = 10

OS type used in the cluster

(defaults to alinux in the default template)

base_os = ubuntu

CloudWatch Logs region

(defaults to NONE in the default template)

cwl_region = NONE

CloudWatch Logs Log Group name

(defaults to NONE in the default template)

cwl_log_group = NONE

Existing EC2 IAM role to be assosiated with the EC2 instances

(defaults to NONE in the default template)

ec2_iam_role = NONE

Extra Json to be merged with the dna.json used by Chef

(defaults to {} in the default template)

extra_json = {}

Additional CloudFormation template to launch with the cluster

additional_cfn_template = NONE

Settings section relating to VPC to be used

vpc_settings = mycluster1-vpc

Settings section relating to EBS volume

ebs_settings = fds-test-volume-2

Settings section relation to scaling

scaling_settings = custom

and my vpc_setting is as follows.

[vpc mycluster1-vpc] master_subnet_id = subnet-4ba6f52c vpc_id = vpc-e5446782

miker2746 commented 6 years ago

hi, I updated the cfncluster and found out what I was wrong. I should use ssh private ID number to connect to the worker nodes, or I should write them to the /etc/hosts file with some custom code names.

Thank you for answering my question.

best regards, Michael

rajachan commented 6 years ago

Michael - You need not configure anything manually to SSH from the master into the compute nodes. The Chef cookbook already does the heavy-lifting for you. It is hard to say what exactly happened without looking at the cfn-init log. I don't see a correlation between opening the log file and your instance freezing up; it might have been something transient. See if you can at least get the last couple lines using tail (tail -n 100 /var/log/cfn-init.log); that will be really useful in understanding the problem.

miker2746 commented 6 years ago

Hi rajachan,

I launched a new cluster, the NFS still failed to set up, here is the /var/log/cfn-init.log file of the new cluster.

`2018-01-09 23:08:18,966 [DEBUG] CloudFormation client initialized with endpoint 2018-01-09 23:08:18,966 [DEBUG] Describing resource MasterServer in stack cfncluster-mycluster2 2018-01-09 23:08:19,081 [INFO] -----------------------Starting build----------------------- 2018-01-09 23:08:19,082 [DEBUG] Not setting a reboot trigger as scheduling support is not available 2018-01-09 23:08:19,083 [INFO] Running configSets: default 2018-01-09 23:08:19,084 [INFO] Running configSet default 2018-01-09 23:08:19,085 [INFO] Running config deployConfigFiles 2018-01-09 23:08:19,086 [DEBUG] No packages specified 2018-01-09 23:08:19,086 [DEBUG] No groups specified 2018-01-09 23:08:19,086 [DEBUG] No users specified 2018-01-09 23:08:19,086 [DEBUG] No sources specified 2018-01-09 23:08:19,086 [DEBUG] Writing content to /etc/chef/client.rb 2018-01-09 23:08:19,086 [DEBUG] Setting mode for /etc/chef/client.rb to 000644 2018-01-09 23:08:19,087 [DEBUG] Setting owner 0 and group 0 for /etc/chef/client.rb 2018-01-09 23:08:19,087 [DEBUG] Writing content to /tmp/dna.json 2018-01-09 23:08:19,087 [DEBUG] Content will be serialized as a JSON structure 2018-01-09 23:08:19,087 [DEBUG] Setting mode for /tmp/dna.json to 000644 2018-01-09 23:08:19,087 [DEBUG] Setting owner 0 and group 0 for /tmp/dna.json 2018-01-09 23:08:19,087 [DEBUG] Writing content to /tmp/extra.json 2018-01-09 23:08:19,087 [DEBUG] Setting mode for /tmp/extra.json to 000644 2018-01-09 23:08:19,088 [DEBUG] Setting owner 0 and group 0 for /tmp/extra.json 2018-01-09 23:08:19,088 [DEBUG] Running command jq 2018-01-09 23:08:19,088 [DEBUG] No test for command jq 2018-01-09 23:08:19,096 [INFO] Command jq succeeded 2018-01-09 23:08:19,096 [DEBUG] Command jq output: 2018-01-09 23:08:19,096 [DEBUG] Running command mkdir 2018-01-09 23:08:19,097 [DEBUG] No test for command mkdir 2018-01-09 23:08:19,099 [INFO] Command mkdir succeeded 2018-01-09 23:08:19,099 [DEBUG] Command mkdir output: 2018-01-09 23:08:19,100 [DEBUG] Running command touch 2018-01-09 23:08:19,100 [DEBUG] No test for command touch 2018-01-09 23:08:19,102 [INFO] Command touch succeeded 2018-01-09 23:08:19,102 [DEBUG] Command touch output: 2018-01-09 23:08:19,102 [DEBUG] No services specified 2018-01-09 23:08:19,104 [INFO] Running config getCookbooks 2018-01-09 23:08:19,105 [DEBUG] No packages specified 2018-01-09 23:08:19,105 [DEBUG] No groups specified 2018-01-09 23:08:19,105 [DEBUG] No users specified 2018-01-09 23:08:19,105 [DEBUG] No sources specified 2018-01-09 23:08:19,105 [DEBUG] No files specified 2018-01-09 23:08:19,105 [DEBUG] Running command berk 2018-01-09 23:08:19,105 [DEBUG] No test for command berk 2018-01-09 23:08:52,045 [INFO] Command berk succeeded 2018-01-09 23:08:52,045 [DEBUG] Command berk output: Resolving cookbook dependencies... Fetching 'cfncluster' from source at . Fetching cookbook index from Installing apt (6.1.4) from ([opscode] Installing build-essential (8.0.4) from ([opscode] Using cfncluster (1.4.0) from source at . Installing compat_resource (12.19.0) from ([opscode] Installing hostname (0.4.2) from ([opscode] Installing hostsfile (3.0.1) from ([opscode] Installing iptables (4.3.1) from ([opscode] Installing line (0.6.3) from ([opscode] Installing mingw (2.0.1) from ([opscode] Installing ohai (5.2.0) from ([opscode] Installing openssh (2.4.1) from ([opscode] Installing poise (2.8.1) from ([opscode] Installing poise-archive (1.5.0) from ([opscode] Installing poise-languages (2.1.1) from ([opscode] Installing poise-python (1.6.0) from ([opscode] Installing seven_zip (2.0.2) from ([opscode] Installing sysctl (0.10.2) from ([opscode] Installing tar (2.0.0) from ([opscode] Installing windows (3.4.3) from ([opscode] Installing yum (5.0.1) from ([opscode] Installing yum-epel (2.1.2) from ([opscode] Vendoring apt (6.1.4) to /etc/chef/cookbooks/apt Vendoring build-essential (8.0.4) to /etc/chef/cookbooks/build-essential Vendoring cfncluster (1.4.0) to /etc/chef/cookbooks/cfncluster Vendoring compat_resource (12.19.0) to /etc/chef/cookbooks/compat_resource Vendoring hostname (0.4.2) to /etc/chef/cookbooks/hostname Vendoring hostsfile (3.0.1) to /etc/chef/cookbooks/hostsfile Vendoring iptables (4.3.1) to /etc/chef/cookbooks/iptables Vendoring line (0.6.3) to /etc/chef/cookbooks/line Vendoring mingw (2.0.1) to /etc/chef/cookbooks/mingw Vendoring ohai (5.2.0) to /etc/chef/cookbooks/ohai Vendoring openssh (2.4.1) to /etc/chef/cookbooks/openssh Vendoring poise (2.8.1) to /etc/chef/cookbooks/poise Vendoring poise-archive (1.5.0) to /etc/chef/cookbooks/poise-archive Vendoring poise-languages (2.1.1) to /etc/chef/cookbooks/poise-languages Vendoring poise-python (1.6.0) to /etc/chef/cookbooks/poise-python Vendoring seven_zip (2.0.2) to /etc/chef/cookbooks/seven_zip Vendoring sysctl (0.10.2) to /etc/chef/cookbooks/sysctl Vendoring tar (2.0.0) to /etc/chef/cookbooks/tar Vendoring windows (3.4.3) to /etc/chef/cookbooks/windows Vendoring yum (5.0.1) to /etc/chef/cookbooks/yum Vendoring yum-epel (2.1.2) to /etc/chef/cookbooks/yum-epel

2018-01-09 23:08:52,046 [DEBUG] No services specified 2018-01-09 23:08:52,048 [INFO] Running config chefPrepEnv 2018-01-09 23:08:52,048 [DEBUG] No packages specified 2018-01-09 23:08:52,048 [DEBUG] No groups specified 2018-01-09 23:08:52,048 [DEBUG] No users specified 2018-01-09 23:08:52,048 [DEBUG] No sources specified 2018-01-09 23:08:52,048 [DEBUG] No files specified 2018-01-09 23:08:52,048 [DEBUG] Running command chef 2018-01-09 23:08:52,048 [DEBUG] No test for command chef 2018-01-09 23:08:58,022 [INFO] Command chef succeeded 2018-01-09 23:08:58,022 [DEBUG] Command chef output: [2018-01-09T23:08:53+00:00] INFO: Forking chef instance to converge... Starting Chef Client, version 12.19.36 [2018-01-09T23:08:53+00:00] INFO: Chef 12.19.36 [2018-01-09T23:08:53+00:00] INFO: Platform: x86_64-linux [2018-01-09T23:08:53+00:00] INFO: Chef-client pid: 2046 [2018-01-09T23:08:55+00:00] INFO: HTTP Request Returned 404 Not Found: Object not found: chefzero://localhost:8889/nodes/ [2018-01-09T23:08:55+00:00] INFO: Setting the run_list to recipe[cfncluster::sge_config] from CLI options [2018-01-09T23:08:55+00:00] WARN: Run List override has been provided. [2018-01-09T23:08:55+00:00] WARN: Original Run List: [recipe[cfncluster::sge_config]] [2018-01-09T23:08:55+00:00] WARN: Overridden Run List: [recipe[cfncluster::_prep_env]] [2018-01-09T23:08:55+00:00] INFO: Run List is [recipe[cfncluster::_prep_env]] [2018-01-09T23:08:55+00:00] INFO: Run List expands to [cfncluster::_prep_env] [2018-01-09T23:08:55+00:00] INFO: Starting Chef Run for [2018-01-09T23:08:55+00:00] INFO: Running start handlers [2018-01-09T23:08:55+00:00] INFO: Start handlers complete. [2018-01-09T23:08:55+00:00] INFO: HTTP Request Returned 404 Not Found: Object not found: resolving cookbooks for run list: ["cfncluster::_prep_env"] [2018-01-09T23:08:56+00:00] INFO: Loading cookbooks [cfncluster@1.4.0, build-essential@8.0.4, poise-python@1.6.0, tar@2.0.0, selinux@2.0.3, nfs@2.4.1, sysctl@0.10.2, yum@5.0.1, yum-epel@2.1.2, openssh@2.4.1, apt@6.1.4, hostname@0.4.2, line@0.6.3, seven_zip@2.0.2, mingw@2.0.1, poise@2.8.1, poise-languages@2.1.1, ohai@5.2.0, compat_resource@12.19.0, iptables@4.3.1, hostsfile@3.0.1, windows@3.4.3, poise-archive@1.5.0] [2018-01-09T23:08:56+00:00] INFO: Skipping removal of obsoleted cookbooks from the cache Synchronizing Cookbooks: [2018-01-09T23:08:56+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/_compute_base_config.rb in the cache. [2018-01-09T23:08:56+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/_compute_custom_config.rb in the cache. [2018-01-09T23:08:56+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/_compute_sge_config.rb in the cache. [2018-01-09T23:08:56+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/_compute_slurm_config.rb in the cache. [2018-01-09T23:08:56+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/_compute_torque_config.rb in the cache. [2018-01-09T23:08:56+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/_ganglia_install.rb in the cache. [2018-01-09T23:08:56+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/_master_base_config.rb in the cache. [2018-01-09T23:08:56+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/_master_custom_config.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/_master_slurm_config.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/_master_torque_config.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/_nvidia_install.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/_setup_python.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/_undo_base_config.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/_undo_master_base_config.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/_update_packages.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/base_config.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/_ec2_udev_rules.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/base_install.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/_master_sge_config.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/custom_install.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/default.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/image_prep.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/sge_config.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/sge_install.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/_prep_env.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/slurm_install.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/torque_config.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/libraries/helpers.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/custom_config.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/amazon/supervisord-init in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/centos-7/ganglia-webfrontend.conf in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/ in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/slurm_config.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/ in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/munge_install.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/cfncluster-ebsnvme-id in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/recipes/torque_install.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/compute_ready in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/CfnCluster-License-README.txt in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/ec2-volid.rules in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/ in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/blacklist-nouveau.conf in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/ in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/sge_inst.conf in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/slurm-init in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/munge-init in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/slurmctld.service in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/supervisord-init in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/ganglia-webfrontend.conf in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/supervisord.conf in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/torque.setup in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/slurmd.service in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/attributes/default.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/ubuntu-14.04/ec2blkdev-init in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/ubuntu-14.04/slurm-init in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/fetch_and_run in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/ in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/ubuntu-16.04/supervisord-init in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/ in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/default/99-cfncluster-user-tty.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/default/cfncluster_supervisord.conf.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/ec2blkdev-init in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/ubuntu-16.04/ec2blkdev-init in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/default/gmond.conf.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/jq-1.4 in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/slurm.csh in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/default/publish_pending.sge.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/default/nodewatcher.cfg.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/default/ in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/default/publish_pending.torque.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/default/munge.key.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/default/slurm.conf.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/amazon/cfncluster_supervisord.conf.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/default/gmetad.conf.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/default/torque.conf.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/default/publish_pending.pbspro.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/default/torque.config.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/files/ubuntu-14.04/supervisord-init in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/default/sqswatcher.cfg.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/default/torque.setup.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/ubuntu/gmond.conf.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/packer_update_centos_base.json in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/default/lsb.hosts.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/ubuntu/cfncluster_supervisord.conf.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/packer_variables.json in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/default/publish_pending.slurm.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/default/torque.server_name.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/LICENSE.txt in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/templates/default/cfnconfig.erb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/ in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/ in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/ in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/Gemfile in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/centos6.elrepo.repo in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/packer_centos7.json in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/chefignore in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/metadata.json in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/packer_ubuntu1604.json in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/ in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/packer_ubuntu1404.json in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/ in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/.rubocop.yml in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/packer_centos6.json in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/NOTICE.txt in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/.kitchen.yml in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/packer_alinux.json in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/Rakefile in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/build-essential/resources/build_essential.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/build-essential/resources/xcode_command_line_tools.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/ in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/build-essential/recipes/default.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/build-essential/ in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/build-essential/ in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/cfncluster/ in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/build-essential/metadata.json in the cache.

[2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/build-essential/ in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/poise-python/recipes/default.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/poise-python/libraries/default.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/poise-python/attributes/default.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/poise-python/files/halite_gem/poise_python/cheftie.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/poise-python/files/halite_gem/poise_python/error.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/build-essential/ in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/build-essential/.foodcritic in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/poise-python/files/halite_gem/poise_python/python_command_mixin.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/poise-python/files/halite_gem/poise_python/python_providers/dummy.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/poise-python/files/halite_gem/poise_python/python_providers/msi.rb in the cache. [2018-01-09T23:08:57+00:00] INFO: Storing updated cookbooks/poise-python/files/halite_gem/poise_python/python_providers/portable_pypy.rb in the cache.

Running handlers: [2018-01-09T23:08:57+00:00] INFO: Running report handlers Running handlers complete [2018-01-09T23:08:57+00:00] INFO: Report handlers complete Chef Client finished, 4/7 resources updated in 04 seconds

2018-01-09 23:08:58,024 [DEBUG] No services specified 2018-01-09 23:08:58,025 [INFO] Running config shellRunPreInstall 2018-01-09 23:08:58,026 [DEBUG] No packages specified 2018-01-09 23:08:58,026 [DEBUG] No groups specified 2018-01-09 23:08:58,026 [DEBUG] No users specified 2018-01-09 23:08:58,026 [DEBUG] No sources specified 2018-01-09 23:08:58,026 [DEBUG] No files specified 2018-01-09 23:08:58,026 [DEBUG] Running command runpreinstall 2018-01-09 23:08:58,026 [DEBUG] No test for command runpreinstall 2018-01-09 23:08:58,047 [INFO] Command runpreinstall succeeded 2018-01-09 23:08:58,047 [DEBUG] Command runpreinstall output: 2018-01-09 23:08:58,047 [DEBUG] No services specified 2018-01-09 23:08:58,048 [INFO] Running config chefConfig 2018-01-09 23:08:58,049 [DEBUG] No packages specified 2018-01-09 23:08:58,049 [DEBUG] No groups specified 2018-01-09 23:08:58,049 [DEBUG] No users specified 2018-01-09 23:08:58,049 [DEBUG] No sources specified 2018-01-09 23:08:58,049 [DEBUG] No files specified 2018-01-09 23:08:58,049 [DEBUG] Running command chef 2018-01-09 23:08:58,049 [DEBUG] No test for command chef 2018-01-09 23:09:37,678 [INFO] Command chef succeeded 2018-01-09 23:09:37,679 [DEBUG] Command chef output: [2018-01-09T23:08:59+00:00] INFO: Forking chef instance to converge... Starting Chef Client, version 12.19.36 [2018-01-09T23:08:59+00:00] INFO: Chef 12.19.36 [2018-01-09T23:08:59+00:00] INFO: Platform: x86_64-linux [2018-01-09T23:08:59+00:00] INFO: Chef-client pid: 2372 [2018-01-09T23:09:00+00:00] INFO: Setting the run_list to recipe[cfncluster::sge_config] from CLI options [2018-01-09T23:09:00+00:00] INFO: Run List is [recipe[cfncluster::sge_config]] [2018-01-09T23:09:00+00:00] INFO: Run List expands to [cfncluster::sge_config] [2018-01-09T23:09:00+00:00] INFO: Starting Chef Run for [2018-01-09T23:09:00+00:00] INFO: Running start handlers [2018-01-09T23:09:00+00:00] INFO: Start handlers complete. [2018-01-09T23:09:00+00:00] INFO: HTTP Request Returned 404 Not Found: Object not found: resolving cookbooks for run list: ["cfncluster::sge_config"] [2018-01-09T23:09:01+00:00] INFO: Loading cookbooks [cfncluster@1.4.0, build-essential@8.0.4, poise-python@1.6.0, tar@2.0.0, selinux@2.0.3, nfs@2.4.1, sysctl@0.10.2, yum@5.0.1, yum-epel@2.1.2, openssh@2.4.1, apt@6.1.4, hostname@0.4.2, line@0.6.3, seven_zip@2.0.2, mingw@2.0.1, poise@2.8.1, poise-languages@2.1.1, ohai@5.2.0, compat_resource@12.19.0, iptables@4.3.1, hostsfile@3.0.1, windows@3.4.3, poise-archive@1.5.0] Synchronizing Cookbooks:

[2018-01-09T23:09:26+00:00] INFO: append_if_no_line[export /home/ebs] sending run action to execute[exportfs] (immediate)

[2018-01-09T23:09:26+00:00] INFO: execute[exportfs] ran successfully

[2018-01-09T23:09:26+00:00] INFO: append_if_no_line[export /home] sending run action to execute[exportfs] (immediate)

[2018-01-09T23:09:26+00:00] INFO: execute[exportfs] ran successfully

[2018-01-09T23:09:30+00:00] INFO: append_if_no_line[export /opt/sge] sending run action to execute[exportfs] (immediate)

[2018-01-09T23:09:30+00:00] INFO: execute[exportfs] ran successfully

Running handlers: [2018-01-09T23:09:37+00:00] INFO: Running report handlers Running handlers complete [2018-01-09T23:09:37+00:00] INFO: Report handlers complete

Deprecated features used! Cloning resource attributes for directory[/home/ebs] from prior resource Previous directory[/home/ebs]: /etc/chef/local-mode-cache/cache/cookbooks/cfncluster/recipes/_master_base_config.rb:54:in from_file' Current directory[/home/ebs]: /etc/chef/local-mode-cache/cache/cookbooks/cfncluster/recipes/_master_base_config.rb:72:infrom_file' at 1 location:

Chef Client finished, 62/190 resources updated in 38 seconds

2018-01-09 23:09:37,681 [DEBUG] No services specified 2018-01-09 23:09:37,682 [INFO] Running config shellRunPostInstall 2018-01-09 23:09:37,682 [DEBUG] No packages specified 2018-01-09 23:09:37,682 [DEBUG] No groups specified 2018-01-09 23:09:37,682 [DEBUG] No users specified 2018-01-09 23:09:37,682 [DEBUG] No sources specified 2018-01-09 23:09:37,683 [DEBUG] No files specified 2018-01-09 23:09:37,683 [DEBUG] Running command runpostinstall 2018-01-09 23:09:37,683 [DEBUG] No test for command runpostinstall 2018-01-09 23:09:37,688 [INFO] Command runpostinstall succeeded 2018-01-09 23:09:37,688 [DEBUG] Command runpostinstall output: 2018-01-09 23:09:37,688 [DEBUG] No services specified 2018-01-09 23:09:37,689 [INFO] Running config shellForkClusterReadyInstall 2018-01-09 23:09:37,690 [DEBUG] No packages specified 2018-01-09 23:09:37,690 [DEBUG] No groups specified 2018-01-09 23:09:37,690 [DEBUG] No users specified 2018-01-09 23:09:37,690 [DEBUG] No sources specified 2018-01-09 23:09:37,690 [DEBUG] No files specified 2018-01-09 23:09:37,690 [DEBUG] Running command clusterreadyinstall 2018-01-09 23:09:37,690 [DEBUG] No test for command clusterreadyinstall 2018-01-09 23:09:37,695 [INFO] Command clusterreadyinstall succeeded 2018-01-09 23:09:37,695 [DEBUG] Command clusterreadyinstall output: Unknown action. Exit gracefully

2018-01-09 23:09:37,696 [DEBUG] No services specified 2018-01-09 23:09:37,696 [INFO] ConfigSets completed 2018-01-09 23:09:37,696 [DEBUG] Not clearing reboot trigger as scheduling support is not available 2018-01-09 23:09:37,696 [INFO] -----------------------Build complete----------------------- 2018-01-09 23:09:37,872 [DEBUG] CloudFormation client initialized with endpoint 2018-01-09 23:09:37,872 [DEBUG] Signaling resource MasterServer in stack cfncluster-mycluster2 with unique ID i-0383682101f9a0ed3 and status SUCCESS`

miker2746 commented 6 years ago

and every time I set the shard_dir=/home of the config file, I can't ssh to the master node from my computer, because I received this failure

totoro@TOTORO:~$ ssh -i ~/aws/key-pair/cfncluster-keypair1.pem ubuntu@
The authenticity of host ' (' can't be established.
ECDSA key fingerprint is SHA256:ZRZJSLAX39zWddllC9mqW+gN5sDXfQD66eWlTCGswKM.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '' (ECDSA) to the list of known hosts.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

the config file was like this:

[cluster testcluster1]
# Name of an existing EC2 KeyPair to enable SSH access to the instances.
key_name = cfncluster-keypair1
# Override path to cloudformation in S3
# (defaults to<aws_region_name>/templates/cfncluster-<version>.cfn.json)
#template_url =
# Cluster Server EC2 instance type
# (defaults to t2.micro for default template)
#compute_instance_type = t2.micro
# Master Server EC2 instance type
# (defaults to t2.micro for default template
#master_instance_type = t2.micro
# Inital number of EC2 instances to launch as compute nodes in the cluster.
# (defaults to 2 for default template)
initial_queue_size = 1
# Maximum number of EC2 instances that can be launched in the cluster.
# (defaults to 10 for the default template)
max_queue_size = 2
# Boolean flag to set autoscaling group to maintain initial size and scale back
# (defaults to false for the default template)
#maintain_initial_size = false
# Cluster scheduler
# (defaults to sge for the default template)
#scheduler = sge
#scheduler = sge
# Type of cluster to launch i.e. ondemand or spot
# (defaults to ondemand for the default template)
#cluster_type = ondemand
# Spot price for the ComputeFleet
#spot_price = 0.00

# ID of a Custom AMI, to use instead of published AMI's
# must find the available AMI
# AMI Name: cfncluster-1.3.0-ubuntu-1604-lts-hvm-201608251414
#custom_ami = ami-406e1f33
#custom_ami = ami-ff8d1886
#custom_ami = ami-96b025ef
#custom_ami = ami-62fa6e1b

# cfncluster fds-image, no NFS
custom_ami = ami-898b1ff0

# cfncluster default ubuntu1604 image in eu-west-1
#custom_ami = ami-9802b1e1

# Specify S3 resource which cfncluster nodes will be granted read-only access
# (defaults to NONE for the default template)
#s3_read_resource = arn:aws:s3:::cfncluster1-s3
# Specify S3 resource which cfncluster nodes will be granted read-write access
# (defaults to NONE for the default template)
#s3_read_write_resource = arn:aws:s3:::cfncluster1-s3
# URL to a preinstall script. This is executed before any of the boot_as_* scripts are run
# (defaults to NONE for the default template)
#pre_install = NONE
# Arguments to be passed to preinstall script
# (defaults to NONE for the default template)
#pre_install_args = NONE
# URL to a postinstall script. This is executed after any of the boot_as_* scripts are run
# (defaults to NONE for the default template)
#post_install = NONE
# Arguments to be passed to postinstall script
# (defaults to NONE for the default template)
#post_install_args = NONE
# HTTP(S) proxy server, typically http://x.x.x.x:8080
# (defaults to NONE for the default template)
#proxy_server = NONE
# Cluster placement group. This placement group must already exist.
# (defaults to NONE for the default template)
#placement_group = NONE
# Cluster placment logic. This enables the whole cluster or only compute to use the placement group
# (defaults to cluster in the default template)
#placement = cluster
# Path/mountpoint for ephemeral drives
# (defaults to /scratch in the default template)
#ephemeral_dir = /scratch

# Path/mountpoint for shared EBS volume
# (defaults to /shared in the default template)

#### if i set this to /home, then all nodes' home directories from computer fleet
#### will be shared through NFS system. Not that AWS EFS but the original NFS system
#shared_dir = /home/ubuntu/ebs
shared_dir = /home

# Encrypted ephemeral drives. In-memory keys, non-recoverable.
# (defaults to false in default template)
#encrypted_ephemeral = false
# MasterServer root volume size in GB. (AMI must support growroot)
# (defaults to 10 in default template)
#master_root_volume_size = 10
# ComputeFleet root volume size in GB. (AMI must support growroot)
# (defaults to 10 in default template)
#compute_root_volume_size = 10

# OS type used in the cluster
# (defaults to alinux in the default template)
#base_os = Ubuntu

# CloudWatch Logs region
# (defaults to NONE in the default template)
#cwl_region = NONE
# CloudWatch Logs Log Group name
# (defaults to NONE in the default template)
#cwl_log_group = NONE
# Existing EC2 IAM role to be assosiated with the EC2 instances
# (defaults to NONE in the default template)
#ec2_iam_role = NONE
# Extra Json to be merged with the dna.json used by Chef
# (defaults to {} in the default template)
#extra_json = {}
# Additional CloudFormation template to launch with the cluster
#additional_cfn_template = NONE
# Settings section relating to VPC to be used
#vpc_settings = cfncluster-vpc-test1
#vpc_settings = mycluster1-vpc

#test vpc_settings
vpc_settings = mycluster2-vpc

# Settings section relating to EBS volume
#ebs_settings = fds-test-volume-2
# Settings section relation to scaling
#scaling_settings = custom

I have no idea what was wrong..

Hope you could help me out.

best regards, Michael

rajachan commented 6 years ago

Michael - Sorry I did not respond to this sooner. This is related to the issue in By mounting the shared directory on /home via the shared_dir config, you are effectively making the contents of the master node's default /home that is in the Master Server's primary EBS volume inaccessible. As a result, the keypair you intended to use to SSH into the master would no longer be available for the SSH authentication. I am going to close this issue, given it is related to, and to avoid having two separate threads to discuss this.