Closed rmarable-flaretx closed 4 days ago
Sorry for the late reply,
This error seems to be related to https://github.com/aws/aws-parallelcluster/issues/6378
The munge key issue referred to in https://github.com/aws/aws-parallelcluster/issues/6378 has been fixed but Rocky LInux 9 clusters are still failing.
Recipe: aws-parallelcluster-slurm::config_munge_key
* munge_key_manager[manage_munge_key] action setup_munge_key[2024-08-27T14:25:47+00:00] INFO: Processing munge_key_manager[manage_munge_key] action setup_munge_key (aws-parallelcluster-slurm::config_munge_key line 73)
(up to date)
* execute[fetch_and_decode_munge_key] action run[2024-08-27T14:25:47+00:00] INFO: Processing execute[fetch_and_decode_munge_key] action run (aws-parallelcluster-slurm::config_munge_key line 66)
[execute] Fetching munge key from AWS Secrets Manager: arn:aws:secretsmanager:us-east-2:[redacted]:secret:munge-key-blah-blah-blah
Created symlink /etc/systemd/system/multi-user.target.wants/munge.service → /usr/lib/systemd/system/munge.service.
Restarting munge service
Job for munge.service failed because the control process exited with error code.
See "systemctl status munge.service" and "journalctl -xeu munge.service" for details.
================================================================================
Error executing action `run` on resource 'execute[fetch_and_decode_munge_key]'
================================================================================
Mixlib::ShellOut::ShellCommandFailed
------------------------------------
Expected process to exit with [0], but received '1'
---- Begin output of //opt/parallelcluster/scripts/slurm/update_munge_key.sh -d ----
STDOUT: Fetching munge key from AWS Secrets Manager: arn:aws:secretsmanager:us-east-2:[redacted]:secret:munge-key-blah-blah-blah
Restarting munge service
STDERR: Created symlink /etc/systemd/system/multi-user.target.wants/munge.service → /usr/lib/systemd/system/munge.service.
Job for munge.service failed because the control process exited with error code.
See "systemctl status munge.service" and "journalctl -xeu munge.service" for details.
---- End output of //opt/parallelcluster/scripts/slurm/update_munge_key.sh -d ----
Ran //opt/parallelcluster/scripts/slurm/update_munge_key.sh -d returned 1
Resource Declaration:
---------------------
# In /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-slurm/resources/munge_key_manager.rb
66: declare_resource(:execute, 'fetch_and_decode_munge_key') do
67: user 'root'
68: group 'root'
69: command "/#{node['cluster']['scripts_dir']}/slurm/update_munge_key.sh -d"
70: end
71: end
Compiled Resource:
------------------
# Declared in /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-slurm/resources/munge_key_manager.rb:66:in `fetch_and_decode_munge_key'
execute("fetch_and_decode_munge_key") do
action [:run]
default_guard_interpreter :execute
command "//opt/parallelcluster/scripts/slurm/update_munge_key.sh -d"
declared_type :execute
cookbook_name "aws-parallelcluster-slurm"
recipe_name "config_munge_key"
user "root"
group "root"
end
System Info:
------------
chef_version=18.4.12
platform=rocky
platform_version=9.4
ruby=ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
program_name=/bin/cinc-client
executable=/opt/cinc/bin/cinc-client
More logs:
[2024-08-27T14:25:49+00:00] ERROR: Running exception handlers
- WriteChefError::WriteHeadNodeChefError
And more:
[2024-08-27T14:25:49+00:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: execute[fetch_and_decode_munge_key] (aws-parallelcluster-slurm::config_munge_key line 66) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'
So to reiterate, this works with Rocky 8 but NOT with Rocky 9.
Hi @rmarable-flaretx,
I was able to resolve the issue on Rocky 9.5 with ParallelCluster 3.11 by adjusting the permissions of the /etc directory.
I created a bash script containing the following command:
sudo chmod 0755 /etc
Then, I updated the pcluster config-file.yml to include this script in both the HeadNode and SlurmQueues sections under CustomActions:
CustomActions: OnNodeStart: Script:
For reference, here is the documentation: https://github.com/aws/aws-parallelcluster/wiki/(3.9.0%E2%80%90current)-Cluster-creation-fails-on-Rocky-9.4
hi @JamesDavidson13 - thanks for the feedback!
Changing the permissions on /etc using an OnStart custom action did the trick.
Applying the suggested fixes outlined on https://github.com/aws/aws-parallelcluster/wiki/(3.9.0%E2%80%90current)-Cluster-creation-fails-on-Rocky-9.4 resolved this issue.
We are unable to bootstrap a custom Rocky LInux 9.4 AMI using ParallelCluster 3.10.1.
Here is the
cfn-init
log stream:From the
system-messages
log strem:From the
chef-client
log:We can't get into the head node so unfortunately we are unable to provide the log files referenced above.
For now, we are dropping back to Rocky Linux 8.
Any guidance you can provide would be appreciated.