Failing to create cluster when using GPU instance with ubuntu 18 and Slurm

DABH commented 4 years ago

Environment:

AWS ParallelCluster / CfnCluster version: 2.5.0
OS: Ubuntu 18.04
Scheduler: Slurm
Master instance type: g3.4xlarge
Compute instance type: g3.4xlarge

Bug description and how to reproduce:

Master instance initializes properly but compute instances get stuck in Initializing state in EC2 console and hence overall cluster setup fails. I let parallelcluster initialize VPCs, security groups, etc. all on its own, they should all be default/valid (I did some manual verification to e.g. make sure VPC settings matched what the docs said they should be).

$ pcluster create -c /home/foo/.parallelcluster/config cluster1
Beginning cluster creation for cluster: cluster1
Creating stack named: parallelcluster-cluster1
Status: parallelcluster-cluster1 - ROLLBACK_IN_PROGRESS                                 
Cluster creation failed.  Failed events:
  - AWS::AutoScaling::AutoScalingGroup ComputeFleet Received 2 FAILURE signal(s) out of 2.  Unable to satisfy 100% MinSuccessfulInstancesPercent requirement

Additional context:

Config file:

[aws]
aws_region_name = us-west-2

[global]
cluster_template = default
update_check = true
sanity_check = true

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[cluster default]
key_name = gpu_keypair
base_os = ubuntu1804
scheduler = slurm
master_instance_type = g3.4xlarge
compute_instance_type = g3.4xlarge
initial_queue_size = 2
max_queue_size = 2
maintain_initial_size = true
vpc_settings = default

[vpc default]
vpc_id = vpc-02de6aff5174ff11e
master_subnet_id = subnet-0cc14c110a85fdd4e

master /var/log/cfn-init.log: attached master /var/log/cloud-init.log: attached master /var/log/cloud-init-output.log: attached master /var/log/jobwatcher: attached master /var/log/sqswatcher: attached cannot connect to compute nodes, no logs attached

cloud-init-output.log cloud-init.log cfn-init.log jobwatcher.log sqswatcher.log

Thanks in advance for your help!!

ddeidda commented 4 years ago

Hi David,

Have you checked for your account limits? https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html

karlvirgil commented 4 years ago

This same thing is happenning to me. Compute nodes continuously fail health checks for us-east1(b-f), but works fine for us-east1a. See this issue: #1383

ddeidda commented 4 years ago

@DABH, from your sqswatcher.log it looks like the compute nodes were created but failed to contact the master node for some reason. Could you check the events log in the CloudFormation Console and share the Status Reason you will find for the Compute Fleet failure event?

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/troubleshooting.html

DABH commented 4 years ago

@ddeidda I verified with AWS support that my service limits are fine, so that is not the issue here. I’ll take a look at the CF console and report back what I find...

sean-smith commented 4 years ago

@DABH The issue referenced in https://github.com/aws/aws-parallelcluster/issues/1383 sounds very similar to what you're experiencing, I need to contents of the /var/log/cfn-init.log from the compute nodes to tell for sure.

To get this file you'll need to suspend the compute nodes from terminating, see https://github.com/aws/aws-parallelcluster/issues/1383#issuecomment-552572307 then SSH in.

DABH commented 4 years ago

@ddeidda , the CF console doesn't seem to report anything new, unfortunately:

2019-11-24 09:45:25 UTC-0800 | parallelcluster-cluster1 | ROLLBACK_IN_PROGRESS | The following resource(s) failed to create: [ComputeFleet]. . Rollback requested by user.
2019-11-24 09:45:24 UTC-0800 | ComputeFleet | CREATE_FAILED | Received 2 FAILURE signal(s) out of 2. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement
2019-11-24 09:45:23 UTC-0800 | ComputeFleet | CREATE_IN_PROGRESS | Received FAILURE signal with UniqueId i-072a5d0a98dfe9a16
2019-11-24 09:45:23 UTC-0800 | ComputeFleet | CREATE_IN_PROGRESS | Received FAILURE signal with UniqueId i-04d360965000a77c8

@sean-smith , I was able to prevent the compute nodes from terminating using your linked suggestion. However, I couldn't ssh in; ssh just timed out when trying to connect to the nodes (not an issue on my end, some issue with the instances). Any other ideas for how I might get into the nodes or otherwise debug this issue? Thanks again for the help.

adigorla commented 4 years ago

Im having a similar issue. The master node creation works fine however the compute nodes dont ever seem to launch. It cluster creation was working fine till last week. Seems to only started after AWS seemed to have push some upgrade to the system. My current limits for the on-demand standard instance type is stated at 4800 vCPUs so I doubt this is the issue. Here are the sqswatcher log for the master server. All other logs seems to have no significant errors.

sqswatcher.log

demartinofra commented 4 years ago

As a general clarification: this is a common error that occurs when the compute nodes fail their bootstrap phase. Although the reported error is the same, the root cause can be very different so I would encourage to have separate issues for different cluster configurations.

Now in order to understand why the compute nodes are failing the start-up phase we need to retrieve some logs from one of these nodes. Specifically here is the list of logs we need from a compute node:

/var/log/cfn-init.log
/var/log/cfn-init-cmd.log
/var/log/cloud-init.log

To retrieve those you can either suspend instance replacement in EC2 as suggested here and then try to ssh into the compute node from the master instance or ssh into the master node and check if there are any archives stored under /home/logs/compute directory. The logs store under /home/logs/compute are the logs coming from compute nodes that had been terminated because of some issue.

As a general recommendation: when facing such issues try to first remove any post-install script or custom ami and check if the cluster creates successfully in that case.

@DABH can you check if any logs are stored under the /home/logs/compute directory? You are not using any custom AMI nor post-install script right?

@Adigorla Can you try to retrieve the logs from a compute instance and open a separate issue so that we can figure out what's causing your cluster to break?

adigorla commented 4 years ago

UPDATE: So turns out a quick and dirty fix seems to be setting update_check = false and allowing all inbound traffic, from any IP, in your Security Group. I'm unsure why I needed to change these settings now because my original settings/template was working fine the week before last.

DABH commented 4 years ago

@demartinofra Unfortunately nothing is written to that directory or any dir like /home/{some user}/logs. And, right, I am not using a custom AMI and am not using any post-install script, just trying to get the simplest possible configuration working :)

@Adigorla Where exactly did you set update_check = false? I'd like to try out your fix...

adigorla commented 4 years ago

@DABH you would need to update these settings in the .parallelcluster/config file. update_check = false should be added/modified in the [global] section of the config. But I think opening up the SG is the more important part of the fix. I looked through your log files and they dont seem similar to my error but still worth a try i guess. LMK if it works.

DABH commented 4 years ago

@Adigorla I tried your suggestions but unfortunately I get the same results, so we must be having different problems. Thanks for the ideas though :)

demartinofra commented 4 years ago

This might be the same problem described here: https://github.com/aws/aws-parallelcluster/issues/1383#issuecomment-557662788

Can you please verify how many CIDR ranges are associated to the vpc used in the cluster?

DABH commented 4 years ago

Hi @demartinofra , the IPv4 CIDR range for the VPC used in the cluster is 10.0.0.0/16 (that is the only CIDR block associated with the VPC). A /16 should be more than enough to hold a master + 2 compute nodes, right?

demartinofra commented 4 years ago

Definitely enough :) I'm sorry if this is getting a little bit frustrating. I'll try to go over your setup and logs again to see if I can spot the problem.

demartinofra commented 4 years ago

I was able to reproduce the issue. Here is the error coming from /var/log/cfn-init.log collected from a compute node:

  * execute[run_nvidiasmi] action run

    ================================================================================
    Error executing action `run` on resource 'execute[run_nvidiasmi]'
    ================================================================================

    Errno::ENOENT
    -------------
    No such file or directory - nvidia-smi

    Resource Declaration:
    ---------------------
    # In /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster/recipes/_compute_slurm_config.rb

     29:   execute "run_nvidiasmi" do
     30:     command 'nvidia-smi'
     31:   end
     32: end

    Compiled Resource:
    ------------------
    # Declared in /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster/recipes/_compute_slurm_config.rb:29:in `from_file'

    execute("run_nvidiasmi") do
      action [:run]
      default_guard_interpreter :execute
      command "nvidia-smi"
      backup 5
      declared_type :execute
      cookbook_name "aws-parallelcluster"
      recipe_name "_compute_slurm_config"
      domain nil
      user nil
    end

    System Info:
    ------------
    chef_version=14.2.0
    platform=ubuntu
    platform_version=18.04
    ruby=ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux]
    program_name=/usr/bin/chef-client
    executable=/opt/chef/bin/chef-client

Recipe: nfs::server
  * service[nfs-kernel-server] action restart
    - restart service service[nfs-kernel-server]

Running handlers:
[2019-11-29T11:32:58+00:00] ERROR: Running exception handlers
[2019-11-29T11:32:58+00:00] ERROR: Running exception handlers
Running handlers complete
[2019-11-29T11:32:58+00:00] ERROR: Exception handlers complete
[2019-11-29T11:32:58+00:00] ERROR: Exception handlers complete
Chef Client failed. 42 resources updated in 31 seconds
[2019-11-29T11:32:58+00:00] FATAL: Stacktrace dumped to /etc/chef/local-mode-cache/cache/chef-stacktrace.out
[2019-11-29T11:32:58+00:00] FATAL: Stacktrace dumped to /etc/chef/local-mode-cache/cache/chef-stacktrace.out
[2019-11-29T11:32:58+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2019-11-29T11:32:58+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2019-11-29T11:32:58+00:00] FATAL: Errno::ENOENT: execute[run_nvidiasmi] (aws-parallelcluster::_compute_slurm_config line 29) had an error: Errno::ENOENT: No such file or directory - nvidia-smi
[2019-11-29T11:32:58+00:00] FATAL: Errno::ENOENT: execute[run_nvidiasmi] (aws-parallelcluster::_compute_slurm_config line 29) had an error: Errno::ENOENT: No such file or directory - nvidia-smi

Marking this issue as a bug. Investigating the root cause.

demartinofra commented 4 years ago

For now the issue seems to be confined to ubuntu18. At least I was able to create a cluster with centos 7.

[UPDATE] Because of a bug on our side the NVIDIA drivers are not installed on Ubuntu18. So far I see the following workarounds:

Switch to a different OS
Install NVIDIA drivers at runtime (we can provide instructions)
Create a custom AMI that installs this drivers (we can provide instructions)

demartinofra commented 4 years ago

We put together and tested a script to patch the issue you are facing and unblock your cluster creation. The script is the following:

#!/bin/bash

set -e

wget https://us.download.nvidia.com/tesla/418.87/NVIDIA-Linux-x86_64-418.87.01.run -O /tmp/nvidia.run
chmod +x /tmp/nvidia.run
/tmp/nvidia.run --silent --dkms --install-libglvnd
rm -f /tmp/nvidia.run

It takes around 50 seconds to execute the script and install the missing drivers. There are 2 alternative ways you can apply this fix:

Upload the script to an S3 bucket and use it as a cluster pre_install script as documented here. This means that every node of the cluster is going to take 1 additional minute to be bootstrapped.
Alternatively, if you don't want to spend any extra time at node start-up, you can create a custom ami as documented here and execute the script to install the drivers as part of the step to customize your instance. Then use the custom_ami with your cluster.

In the meanwhile we are going to work with priority on a permanent fix and provide a new official patch release as soon as possible.

Please let us know if you need additional guidance to apply the fix and again apologies for the inconvenience.

DABH commented 4 years ago

Thanks so much @demartinofra for figuring out the root cause of this and for working on fixes. The workarounds are great for now but will keep an eye out for a permanent fix as well. Feel free to close this issue as you see fit - thanks again!

demartinofra commented 4 years ago

Glad to hear that!

I'll keep this open so that you get notified as soon as the fix is released.

demartinofra commented 4 years ago

Fixed in https://github.com/aws/aws-parallelcluster/releases/tag/v2.5.1

francisreyes-tfs commented 2 years ago

This has reappeared in v3.0.0 .. for Amazon Linux 2

[2021-12-21T15:39:16+00:00] FATAL: Stacktrace dumped to /etc/chef/local-mode-cache/cache/cinc-stacktrace.out [2021-12-21T15:39:16+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report [2021-12-21T15:39:16+00:00] FATAL: Errno::ENOENT: execute[run_nvidiasmi] (aws-parallelcluster::compute_slurm_config line 40) had an error: Errno::ENOENT: No such file or directory - nvidia-smi

error_exit 'Failed to run bootstrap recipes. If --norollback was specified, check /var/log/cfn-init.log and /var/log/cloud-init-output.log.'

aws / aws-parallelcluster

Failing to create cluster when using GPU instance with ubuntu 18 and Slurm #1479