aws-samples / aws-eda-slurm-cluster

AWS Slurm Cluster for EDA Workloads
MIT No Attribution
24 stars 7 forks source link

[BUG] After cluster update include paths not fixed for submitter node use #161

Closed cartalla closed 9 months ago

cartalla commented 10 months ago

Describe the bug

After an update to change instance types, I get the following error from Slurm commands:

squeue: error: s_p_parse_file: cannot stat file /opt/slurm/etc/pcluster/slurm_parallelcluster_od-2-gb_partition.conf: No such file or directory, retrying in 1sec up to 60sec
squeue: error: "Include" failed in file /opt/slurm/edapc-3-7-0-c7-x86-1/etc/slurm_parallelcluster.conf line 16

The path was supposed to be updated to /opt/slurm/ClusterName/etc so it work both on the controller and submitter instances.

Expected behavior Slurm commands work.

cartalla commented 10 months ago

Can see that on_head_node_updated.sh ran by looking in /var/log/cfn-init-cmd-log. I see that the following ansible play failed:

2023-09-27 10:43:15,899 P17476 [INFO]   fatal: [local]: FAILED! => changed=true 
2023-09-27 10:43:15,899 P17476 [INFO]     cmd: |-
2023-09-27 10:43:15,899 P17476 [INFO]       set -ex
2023-09-27 10:43:15,899 P17476 [INFO]     
2023-09-27 10:43:15,899 P17476 [INFO]       /opt/slurm/config/bin//create_users_groups.py -i /opt/slurm/config/users_groups.json
2023-09-27 10:43:15,899 P17476 [INFO]     delta: '0:00:00.038857'
2023-09-27 10:43:15,899 P17476 [INFO]     end: '2023-09-27 10:43:15.566711'
2023-09-27 10:43:15,899 P17476 [INFO]     msg: non-zero return code
2023-09-27 10:43:15,899 P17476 [INFO]     rc: 1
2023-09-27 10:43:15,899 P17476 [INFO]     start: '2023-09-27 10:43:15.527854'
2023-09-27 10:43:15,899 P17476 [INFO]     stderr: |-
2023-09-27 10:43:15,899 P17476 [INFO]       + /opt/slurm/config/bin//create_users_groups.py -i /opt/slurm/config/users_groups.json
2023-09-27 10:43:15,899 P17476 [INFO]       Traceback (most recent call last):
2023-09-27 10:43:15,899 P17476 [INFO]         File "/opt/slurm/config/bin//create_users_groups.py", line 109, in <module>
2023-09-27 10:43:15,899 P17476 [INFO]           main(args.filename)
2023-09-27 10:43:15,899 P17476 [INFO]         File "/opt/slurm/config/bin//create_users_groups.py", line 49, in main
2023-09-27 10:43:15,899 P17476 [INFO]           subprocess.check_output(['groupadd', '-g', gid, group_name], stderr=subprocess.STDOUT)
2023-09-27 10:43:15,900 P17476 [INFO]         File "/usr/lib64/python3.6/subprocess.py", line 356, in check_output
2023-09-27 10:43:15,900 P17476 [INFO]           **kwargs).stdout
2023-09-27 10:43:15,900 P17476 [INFO]         File "/usr/lib64/python3.6/subprocess.py", line 423, in run
2023-09-27 10:43:15,900 P17476 [INFO]           with Popen(*popenargs, **kwargs) as process:
2023-09-27 10:43:15,900 P17476 [INFO]         File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__
2023-09-27 10:43:15,900 P17476 [INFO]           restore_signals, start_new_session)
2023-09-27 10:43:15,900 P17476 [INFO]         File "/usr/lib64/python3.6/subprocess.py", line 1364, in _execute_child
2023-09-27 10:43:15,900 P17476 [INFO]           raise child_exception_type(errno_num, err_msg, err_filename)
2023-09-27 10:43:15,900 P17476 [INFO]       FileNotFoundError: [Errno 2] No such file or directory: 'groupadd': 'groupadd'
2023-09-27 10:43:15,900 P17476 [INFO]     stderr_lines: <omitted>
2023-09-27 10:43:15,900 P17476 [INFO]     stdout: ''
2023-09-27 10:43:15,900 P17476 [INFO]     stdout_lines: <omitted>
2023-09-27 10:43:15,900 P17476 [INFO]   
cartalla commented 10 months ago

After resolving this error I got the following:

2023-09-27 11:36:11,276 P26183 [INFO]   fatal: [local]: FAILED! => changed=true 
2023-09-27 11:36:11,277 P26183 [INFO]     cmd: ifconfig eth0 txqueuelen 4096
2023-09-27 11:36:11,277 P26183 [INFO]     delta: '0:00:00.002916'
2023-09-27 11:36:11,277 P26183 [INFO]     end: '2023-09-27 11:36:10.936460'
2023-09-27 11:36:11,277 P26183 [INFO]     msg: non-zero return code
2023-09-27 11:36:11,277 P26183 [INFO]     rc: 127
2023-09-27 11:36:11,277 P26183 [INFO]     start: '2023-09-27 11:36:10.933544'
2023-09-27 11:36:11,277 P26183 [INFO]     stderr: '/bin/sh: ifconfig: command not found'
2023-09-27 11:36:11,277 P26183 [INFO]     stderr_lines: <omitted>
2023-09-27 11:36:11,277 P26183 [INFO]     stdout: ''
2023-09-27 11:36:11,277 P26183 [INFO]     stdout_lines: <omitted>