aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
829 stars 312 forks source link

Unable to bootstrap pcluster-3.10.1 on Rocky LInux 9.4 #6371

Closed rmarable-flaretx closed 4 days ago

rmarable-flaretx commented 3 months ago

We are unable to bootstrap a custom Rocky LInux 9.4 AMI using ParallelCluster 3.10.1.

Here is the cfn-init log stream:

    {
      "message": "2024-07-29 14:07:13,212 [ERROR] Error encountered during build of chefConfig: Command chef failed",
      "timestamp": "2024-07-29T14:07:13.212Z"
    },
    {
      "message": "Traceback (most recent call last):\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py\", line 579, in run_config\n    CloudFormationCarpenter(config, self._auth_config, self.strict_mode).build(worklog)\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py\", line 277, in build\n    changes['commands'] = CommandTool().apply(\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/command_tool.py\", line 127, in apply\n    raise ToolError(u\"Command %s failed\" % name)",
      "timestamp": "2024-07-29T14:07:13.212Z"
    },
    {
      "message": "cfnbootstrap.construction_errors.ToolError: Command chef failed",
      "timestamp": "2024-07-29T14:07:13.212Z"
    },
    {
      "message": "2024-07-29 14:07:13,296 [ERROR] -----------------------BUILD FAILED!------------------------",
      "timestamp": "2024-07-29T14:07:13.296Z"
    },
    {
      "message": "2024-07-29 14:07:13,296 [ERROR] Unhandled exception during build: Command chef failed",
      "timestamp": "2024-07-29T14:07:13.296Z"
    },
    {
      "message": "Traceback (most recent call last):\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/bin/cfn-init\", line 181, in <module>\n    worklog.build(metadata, configSets, strict_mode)\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py\", line 137, in build\n    Contractor(metadata, strict_mode).build(configSets, self)\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py\", line 567, in build\n    self.run_config(config, worklog)\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py\", line 579, in run_config\n    CloudFormationCarpenter(config, self._auth_config, self.strict_mode).build(worklog)\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py\", line 277, in build\n    changes['commands'] = CommandTool().apply(\n  File \"/opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/command_tool.py\", line 127, in apply\n    raise ToolError(u\"Command %s failed\" % name)",
      "timestamp": "2024-07-29T14:07:13.296Z"
    },
    {
      "message": "cfnbootstrap.construction_errors.ToolError: Command chef failed",
      "timestamp": "2024-07-29T14:07:13.296Z"
    }

From the system-messages log strem:

    {
      "message": "Jul 29 14:07:23 ip-10-2-34-41 cloud-init[1084]: + /opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/bin/cfn-signal --exit-code=1 '--reason=Failed to run chef recipe aws-parallelcluster-slurm::config_munge_key line 27. Please check /var/log/chef-client.log in the head node, or check the chef-client.log in CloudWatch logs. Please refer to https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html for more details.' 'https://cloudformation-waitcondition-us-east-2.s3.us-east-2.amazonaws.com/arn%3Aaws%3Acloudformation%3Aus-east-2%3A227394971585%3Astack/darius/3a0f8320-4db1-11ef-a95c-0a041a247431/3a117ef0-4db1-11ef-a95c-0a041a247431/HeadNodeWaitConditionHandle20240729134822?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240729T134828Z&X-Amz-SignedHeaders=host&X-Amz-Expires=86399&X-Amz-Credential=AKIAVRFIPK6PEIG2DZWK%2F20240729%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Signature=a7a1c96d932fa315e993bee2c2909d6ed8bbe74aa1377a0d97b064a6961a15fc' --region us-east-2 --url https://cloudformation.us-east-2.amazonaws.com",
      "timestamp": "2024-07-29T14:07:23.000Z"
    },

From the chef-client log:

    {
      "message": "    \n    ================================================================================\n    Error executing action `restart` on resource 'service[munge]'\n    ================================================================================\n    \n    Mixlib::ShellOut::ShellCommandFailed\n    ------------------------------------\n    Expected process to exit with [0], but received '1'\n    ---- Begin output of [\"/bin/systemctl\", \"--system\", \"restart\", \"munge\"] ----\n    STDOUT: \n    STDERR: Job for munge.service failed because the control process exited with error code.\n    See \"systemctl status munge.service\" and \"journalctl -xeu munge.service\" for details.\n    ---- End output of [\"/bin/systemctl\", \"--system\", \"restart\", \"munge\"] ----\n    Ran [\"/bin/systemctl\", \"--system\", \"restart\", \"munge\"] returned 1\n    \n    Resource Declaration:\n    ---------------------\n    # In /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-slurm/resources/munge_key_manager.rb\n    \n     27:   declare_resource(:service, \"munge\") do\n     28:     supports restart: true\n     29:     action :restart\n     30:     retries 5\n     31:     retry_delay 10\n     32:   end unless on_docker?\n     33: end\n     34: \n    \n    Compiled Resource:\n    ------------------\n    # Declared in /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-slurm/resources/munge_key_manager.rb:27:in `restart_munge_service'\n    \n    service(\"munge\") do\n      action [:restart]\n      default_guard_interpreter :default\n      declared_type :service\n      cookbook_name \"aws-parallelcluster-slurm\"\n      recipe_name \"config_munge_key\"\n      supports {:restart=>true}\n      retries 5\n      retry_delay 10\n    end\n    \n    System Info:\n    ------------\n    chef_version=18.4.12\n    platform=rocky\n    platform_version=9.4\n    ruby=ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]\n    program_name=/bin/cinc-client\n    executable=/opt/cinc/bin/cinc-client\n    ",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] INFO: Running queued delayed notifications before re-raising exception\n",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "Running handlers:",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] ERROR: Running exception handlers\n  - WriteChefError::WriteHeadNodeChefError",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "Running handlers complete",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] ERROR: Exception handlers complete",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "Infra Phase failed. 64 resources updated in 01 minutes 09 seconds",
      "timestamp": "2024-07-29T14:07:13.246Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] FATAL: Stacktrace dumped to /etc/chef/local-mode-cache/cache/cinc-stacktrace.out",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] FATAL: ---------------------------------------------------------------------------------------",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] FATAL: PLEASE PROVIDE THE CONTENTS OF THE stacktrace.out FILE (above) IF YOU FILE A BUG REPORT",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] FATAL: ---------------------------------------------------------------------------------------",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "[2024-07-29T14:07:13+00:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: service[munge] (aws-parallelcluster-slurm::config_munge_key line 27) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "---- Begin output of [\"/bin/systemctl\", \"--system\", \"restart\", \"munge\"] ----",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "STDOUT: ",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "STDERR: Job for munge.service failed because the control process exited with error code.",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "See \"systemctl status munge.service\" and \"journalctl -xeu munge.service\" for details.",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "---- End output of [\"/bin/systemctl\", \"--system\", \"restart\", \"munge\"] ----",
      "timestamp": "2024-07-29T14:07:13.247Z"
    },
    {
      "message": "Ran [\"/bin/systemctl\", \"--system\", \"restart\", \"munge\"] returned 1",
      "timestamp": "2024-07-29T14:07:17.561Z"
    }

We can't get into the head node so unfortunately we are unable to provide the log files referenced above.

For now, we are dropping back to Rocky Linux 8.

Any guidance you can provide would be appreciated.

hanwen-pcluste commented 2 months ago

Sorry for the late reply,

This error seems to be related to https://github.com/aws/aws-parallelcluster/issues/6378

rmarable-flaretx commented 2 months ago

The munge key issue referred to in https://github.com/aws/aws-parallelcluster/issues/6378 has been fixed but Rocky LInux 9 clusters are still failing.

Recipe: aws-parallelcluster-slurm::config_munge_key
  * munge_key_manager[manage_munge_key] action setup_munge_key[2024-08-27T14:25:47+00:00] INFO: Processing munge_key_manager[manage_munge_key] action setup_munge_key (aws-parallelcluster-slurm::config_munge_key line 73)
 (up to date)
  * execute[fetch_and_decode_munge_key] action run[2024-08-27T14:25:47+00:00] INFO: Processing execute[fetch_and_decode_munge_key] action run (aws-parallelcluster-slurm::config_munge_key line 66)

    [execute] Fetching munge key from AWS Secrets Manager: arn:aws:secretsmanager:us-east-2:[redacted]:secret:munge-key-blah-blah-blah
              Created symlink /etc/systemd/system/multi-user.target.wants/munge.service → /usr/lib/systemd/system/munge.service.
              Restarting munge service
              Job for munge.service failed because the control process exited with error code.
              See "systemctl status munge.service" and "journalctl -xeu munge.service" for details.

    ================================================================================
    Error executing action `run` on resource 'execute[fetch_and_decode_munge_key]'
    ================================================================================

    Mixlib::ShellOut::ShellCommandFailed
    ------------------------------------
    Expected process to exit with [0], but received '1'
    ---- Begin output of //opt/parallelcluster/scripts/slurm/update_munge_key.sh -d ----
    STDOUT: Fetching munge key from AWS Secrets Manager: arn:aws:secretsmanager:us-east-2:[redacted]:secret:munge-key-blah-blah-blah
    Restarting munge service
    STDERR: Created symlink /etc/systemd/system/multi-user.target.wants/munge.service → /usr/lib/systemd/system/munge.service.
    Job for munge.service failed because the control process exited with error code.
    See "systemctl status munge.service" and "journalctl -xeu munge.service" for details.
    ---- End output of //opt/parallelcluster/scripts/slurm/update_munge_key.sh -d ----
    Ran //opt/parallelcluster/scripts/slurm/update_munge_key.sh -d returned 1

    Resource Declaration:
    ---------------------
    # In /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-slurm/resources/munge_key_manager.rb

     66:   declare_resource(:execute, 'fetch_and_decode_munge_key') do
     67:     user 'root'
     68:     group 'root'
     69:     command "/#{node['cluster']['scripts_dir']}/slurm/update_munge_key.sh -d"
     70:   end
     71: end

    Compiled Resource:
    ------------------
    # Declared in /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-slurm/resources/munge_key_manager.rb:66:in `fetch_and_decode_munge_key'

    execute("fetch_and_decode_munge_key") do
      action [:run]
      default_guard_interpreter :execute
      command "//opt/parallelcluster/scripts/slurm/update_munge_key.sh -d"
      declared_type :execute
      cookbook_name "aws-parallelcluster-slurm"
      recipe_name "config_munge_key"
      user "root"
      group "root"
    end

    System Info:
    ------------
    chef_version=18.4.12
    platform=rocky
    platform_version=9.4
    ruby=ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
    program_name=/bin/cinc-client
    executable=/opt/cinc/bin/cinc-client

More logs:

[2024-08-27T14:25:49+00:00] ERROR: Running exception handlers
  - WriteChefError::WriteHeadNodeChefError

And more:

[2024-08-27T14:25:49+00:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: execute[fetch_and_decode_munge_key] (aws-parallelcluster-slurm::config_munge_key line 66) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'

So to reiterate, this works with Rocky 8 but NOT with Rocky 9.

JamesDavidson13 commented 3 weeks ago

Hi @rmarable-flaretx,

I was able to resolve the issue on Rocky 9.5 with ParallelCluster 3.11 by adjusting the permissions of the /etc directory.

I created a bash script containing the following command:

sudo chmod 0755 /etc

Then, I updated the pcluster config-file.yml to include this script in both the HeadNode and SlurmQueues sections under CustomActions:

CustomActions: OnNodeStart: Script:

For reference, here is the documentation: https://github.com/aws/aws-parallelcluster/wiki/(3.9.0%E2%80%90current)-Cluster-creation-fails-on-Rocky-9.4

rmarable-flaretx commented 5 days ago

hi @JamesDavidson13 - thanks for the feedback!

Changing the permissions on /etc using an OnStart custom action did the trick.

rmarable-flaretx commented 4 days ago

Applying the suggested fixes outlined on https://github.com/aws/aws-parallelcluster/wiki/(3.9.0%E2%80%90current)-Cluster-creation-fails-on-Rocky-9.4 resolved this issue.