Closed DABH closed 4 years ago
Hi David,
Have you checked for your account limits? https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html
This same thing is happenning to me. Compute nodes continuously fail health checks for us-east1(b-f), but works fine for us-east1a. See this issue: #1383
@DABH, from your sqswatcher.log it looks like the compute nodes were created but failed to contact the master node for some reason. Could you check the events log in the CloudFormation Console and share the Status Reason you will find for the Compute Fleet failure event?
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/troubleshooting.html
@ddeidda I verified with AWS support that my service limits are fine, so that is not the issue here. I’ll take a look at the CF console and report back what I find...
@DABH The issue referenced in https://github.com/aws/aws-parallelcluster/issues/1383 sounds very similar to what you're experiencing, I need to contents of the /var/log/cfn-init.log
from the compute nodes to tell for sure.
To get this file you'll need to suspend the compute nodes from terminating, see https://github.com/aws/aws-parallelcluster/issues/1383#issuecomment-552572307 then SSH in.
@ddeidda , the CF console doesn't seem to report anything new, unfortunately:
2019-11-24 09:45:25 UTC-0800 | parallelcluster-cluster1 | ROLLBACK_IN_PROGRESS | The following resource(s) failed to create: [ComputeFleet]. . Rollback requested by user.
2019-11-24 09:45:24 UTC-0800 | ComputeFleet | CREATE_FAILED | Received 2 FAILURE signal(s) out of 2. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement
2019-11-24 09:45:23 UTC-0800 | ComputeFleet | CREATE_IN_PROGRESS | Received FAILURE signal with UniqueId i-072a5d0a98dfe9a16
2019-11-24 09:45:23 UTC-0800 | ComputeFleet | CREATE_IN_PROGRESS | Received FAILURE signal with UniqueId i-04d360965000a77c8
@sean-smith , I was able to prevent the compute nodes from terminating using your linked suggestion. However, I couldn't ssh in; ssh just timed out when trying to connect to the nodes (not an issue on my end, some issue with the instances). Any other ideas for how I might get into the nodes or otherwise debug this issue? Thanks again for the help.
Im having a similar issue. The master node creation works fine however the compute nodes dont ever seem to launch. It cluster creation was working fine till last week. Seems to only started after AWS seemed to have push some upgrade to the system. My current limits for the on-demand standard instance type is stated at 4800 vCPUs so I doubt this is the issue. Here are the sqswatcher log for the master server. All other logs seems to have no significant errors.
As a general clarification: this is a common error that occurs when the compute nodes fail their bootstrap phase. Although the reported error is the same, the root cause can be very different so I would encourage to have separate issues for different cluster configurations.
Now in order to understand why the compute nodes are failing the start-up phase we need to retrieve some logs from one of these nodes. Specifically here is the list of logs we need from a compute node:
To retrieve those you can either suspend instance replacement in EC2 as suggested here and then try to ssh into the compute node from the master instance or ssh into the master node and check if there are any archives stored under /home/logs/compute
directory. The logs store under /home/logs/compute
are the logs coming from compute nodes that had been terminated because of some issue.
As a general recommendation: when facing such issues try to first remove any post-install script or custom ami and check if the cluster creates successfully in that case.
@DABH can you check if any logs are stored under the /home/logs/compute
directory? You are not using any custom AMI nor post-install script right?
@Adigorla Can you try to retrieve the logs from a compute instance and open a separate issue so that we can figure out what's causing your cluster to break?
UPDATE: So turns out a quick and dirty fix seems to be setting update_check = false and allowing all inbound traffic, from any IP, in your Security Group. I'm unsure why I needed to change these settings now because my original settings/template was working fine the week before last.
@demartinofra Unfortunately nothing is written to that directory or any dir like /home/{some user}/logs. And, right, I am not using a custom AMI and am not using any post-install script, just trying to get the simplest possible configuration working :)
@Adigorla Where exactly did you set update_check = false
? I'd like to try out your fix...
@DABH you would need to update these settings in the .parallelcluster/config file. update_check = false
should be added/modified in the [global] section of the config. But I think opening up the SG is the more important part of the fix. I looked through your log files and they dont seem similar to my error but still worth a try i guess. LMK if it works.
@Adigorla I tried your suggestions but unfortunately I get the same results, so we must be having different problems. Thanks for the ideas though :)
This might be the same problem described here: https://github.com/aws/aws-parallelcluster/issues/1383#issuecomment-557662788
Can you please verify how many CIDR ranges are associated to the vpc used in the cluster?
Hi @demartinofra , the IPv4 CIDR range for the VPC used in the cluster is 10.0.0.0/16
(that is the only CIDR block associated with the VPC). A /16 should be more than enough to hold a master + 2 compute nodes, right?
Definitely enough :) I'm sorry if this is getting a little bit frustrating. I'll try to go over your setup and logs again to see if I can spot the problem.
I was able to reproduce the issue. Here is the error coming from /var/log/cfn-init.log collected from a compute node:
* execute[run_nvidiasmi] action run
================================================================================
Error executing action `run` on resource 'execute[run_nvidiasmi]'
================================================================================
Errno::ENOENT
-------------
No such file or directory - nvidia-smi
Resource Declaration:
---------------------
# In /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster/recipes/_compute_slurm_config.rb
29: execute "run_nvidiasmi" do
30: command 'nvidia-smi'
31: end
32: end
Compiled Resource:
------------------
# Declared in /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster/recipes/_compute_slurm_config.rb:29:in `from_file'
execute("run_nvidiasmi") do
action [:run]
default_guard_interpreter :execute
command "nvidia-smi"
backup 5
declared_type :execute
cookbook_name "aws-parallelcluster"
recipe_name "_compute_slurm_config"
domain nil
user nil
end
System Info:
------------
chef_version=14.2.0
platform=ubuntu
platform_version=18.04
ruby=ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux]
program_name=/usr/bin/chef-client
executable=/opt/chef/bin/chef-client
Recipe: nfs::server
* service[nfs-kernel-server] action restart
- restart service service[nfs-kernel-server]
Running handlers:
[2019-11-29T11:32:58+00:00] ERROR: Running exception handlers
[2019-11-29T11:32:58+00:00] ERROR: Running exception handlers
Running handlers complete
[2019-11-29T11:32:58+00:00] ERROR: Exception handlers complete
[2019-11-29T11:32:58+00:00] ERROR: Exception handlers complete
Chef Client failed. 42 resources updated in 31 seconds
[2019-11-29T11:32:58+00:00] FATAL: Stacktrace dumped to /etc/chef/local-mode-cache/cache/chef-stacktrace.out
[2019-11-29T11:32:58+00:00] FATAL: Stacktrace dumped to /etc/chef/local-mode-cache/cache/chef-stacktrace.out
[2019-11-29T11:32:58+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2019-11-29T11:32:58+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2019-11-29T11:32:58+00:00] FATAL: Errno::ENOENT: execute[run_nvidiasmi] (aws-parallelcluster::_compute_slurm_config line 29) had an error: Errno::ENOENT: No such file or directory - nvidia-smi
[2019-11-29T11:32:58+00:00] FATAL: Errno::ENOENT: execute[run_nvidiasmi] (aws-parallelcluster::_compute_slurm_config line 29) had an error: Errno::ENOENT: No such file or directory - nvidia-smi
Marking this issue as a bug. Investigating the root cause.
For now the issue seems to be confined to ubuntu18. At least I was able to create a cluster with centos 7.
[UPDATE] Because of a bug on our side the NVIDIA drivers are not installed on Ubuntu18. So far I see the following workarounds:
We put together and tested a script to patch the issue you are facing and unblock your cluster creation. The script is the following:
#!/bin/bash
set -e
wget https://us.download.nvidia.com/tesla/418.87/NVIDIA-Linux-x86_64-418.87.01.run -O /tmp/nvidia.run
chmod +x /tmp/nvidia.run
/tmp/nvidia.run --silent --dkms --install-libglvnd
rm -f /tmp/nvidia.run
It takes around 50 seconds to execute the script and install the missing drivers. There are 2 alternative ways you can apply this fix:
pre_install
script as documented here. This means that every node of the cluster is going to take 1 additional minute to be bootstrapped.In the meanwhile we are going to work with priority on a permanent fix and provide a new official patch release as soon as possible.
Please let us know if you need additional guidance to apply the fix and again apologies for the inconvenience.
Thanks so much @demartinofra for figuring out the root cause of this and for working on fixes. The workarounds are great for now but will keep an eye out for a permanent fix as well. Feel free to close this issue as you see fit - thanks again!
Glad to hear that!
I'll keep this open so that you get notified as soon as the fix is released.
This has reappeared in v3.0.0 .. for Amazon Linux 2
[2021-12-21T15:39:16+00:00] FATAL: Stacktrace dumped to /etc/chef/local-mode-cache/cache/cinc-stacktrace.out [2021-12-21T15:39:16+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report [2021-12-21T15:39:16+00:00] FATAL: Errno::ENOENT: execute[run_nvidiasmi] (aws-parallelcluster::compute_slurm_config line 40) had an error: Errno::ENOENT: No such file or directory - nvidia-smi
Environment:
Bug description and how to reproduce:
Master instance initializes properly but compute instances get stuck in Initializing state in EC2 console and hence overall cluster setup fails. I let parallelcluster initialize VPCs, security groups, etc. all on its own, they should all be default/valid (I did some manual verification to e.g. make sure VPC settings matched what the docs said they should be).
Additional context:
Config file:
master
/var/log/cfn-init.log
: attached master/var/log/cloud-init.log
: attached master/var/log/cloud-init-output.log
: attached master/var/log/jobwatcher
: attached master/var/log/sqswatcher
: attached cannot connect to compute nodes, no logs attachedcloud-init-output.log cloud-init.log cfn-init.log jobwatcher.log sqswatcher.log
Thanks in advance for your help!!