clustermgtd fails when I configure a custom partition for the Master node

vbosquier commented 3 years ago

Hi ParallelCluster Dev Team!

With previous versions of ParallelCluster (up to 2.8.1), we used to configure an additional partition in Slurm for our Master node.

We are currently moving to PC2.10.1 with multiple compute queues support.

After we add the custom Nodename and Partition in a dedicated file that we include at the end of the slurm.conf file, we get the following issues:

The supervidord+cfn_hup+clustermgtd do not start automatically anymore (I could see that both 3 processes are supposed to be started by supervisord).
If I start them manually (through supervisord), the clustermgtd_heartbeat file is not created
In the clustermgtd logs, the following error messages appear every 60 seconds:

"2021-01-20 15:13:42,304 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2021-01-20 15:13:43,372 - [slurm_plugin.clustermgtd:_get_node_info_from_partition] - ERROR - Failed when getting partition/node states from scheduler with exception 2021-01-20 15:13:43,373 - [slurm_plugin.clustermgtd:manage_cluster] - ERROR - Unable to get partition/node info from slurm, no other action can be performed. Sleeping..."

After investigation, the check fails because of the Master node. As soon as we remove the Master node and custom partition from the configuration and restart the slurm daemons, the heartbeat works fine again...

Can you help us find the appropriate way to have the Master Node in a custom partition AND maintain successful clustermgtd checks on the compute partitions?

Best regards, Vincent.

demartinofra commented 3 years ago

Hi Vincent,

unfortunately there is no way to do it with the current version of clustermgtd. We would need to expose an option in the clustermgtd config file in order to exclude some nodes from the daemon management operations.

If you want to go ahead and patch the clustermgtd to be able to manually add additional nodes to Slurm, the required code change should be minimal. From a very quick look it seems like that removing the head node from the active_nodes, inactive_nodes lists initialized here should be enough: https://github.com/aws/aws-parallelcluster-node/blob/develop/src/slurm_plugin/clustermgtd.py#L397.

chambm commented 3 years ago

As long as I set the master's node name to the parition-instancetype-[st/dy]-ordinal format, I was able to add my master node to slurm.conf. In my case, that's master-t3large-dy-1 (st didn't work for me actually, even though the node isn't actually dynamic). I also had to add this as a valid alias to /etc/hosts, and enable slurmd on master, and after that it worked as expected. Tested on 2.10.3.

Here are my postinstall actions relevant to this:


  # generate a clustermgtd-compatible node name
  master_type=$(curl http://169.254.169.254/latest/meta-data/instance-type)
  master_name=$(echo "master-dy-$master_type-1" | tr -d '.')
  master_memory=$(awk '/MemTotal/ { printf "%d \n", $2/1024 * 0.75 }' /proc/meminfo)

  # do not use suspend/resume scripts for master node
  echo "SuspendExcNodes=$master_name" >> /opt/slurm/etc/slurm.conf

  # add definition for master node
  echo "NodeName=$master_name CPUs=2 RealMemory=$master_memory State=UNKNOWN Feature=local,$master_type" >> /opt/slurm/etc/slurm.conf

  # add partition for master node
  echo "PartitionName=master Nodes=$master_name MaxTime=INFINITE DefMemPerCPU=2048 Default=NO" >> /opt/slurm/etc/slurm.conf

  # add node name to hosts file
  sed -i "$ s/$/ $master_name/" /etc/hosts

  # enable slurmd service
  cp /etc/chef/cookbooks/aws-parallelcluster/files/default/slurmd.service /etc/systemd/system
  service slurmd start
  service slurmctld restart

It would be good though to have a simple option in the pcluster config to include the master node as its own partition.

rexcsn commented 3 years ago

Hi @chambm ,

Thank you for suggesting the workaround. However, I would like to point out that this modification is NOT safe.

clustermgtd will attempt to replace any node configured in slurm that is in a problematic state, so if head node somehow becomes DOWN in slurm, or another problematic scenario happens, the head node instance will likely be terminated and this would break the cluster.

We would suggest not to configure head node in slurm at this time. If you want to make a workaround, please be sure that the head node is excluded from clustermgtd actions, so it wouldn't be terminated by the daemon. We will continue to evaluate this feature request.

Thank you!

chambm commented 3 years ago

Hmm. Several times while I was working on getting the head node to work on SLURM, it was getting set to DOWN by clustermgtd, but it was never terminated. I definitely would have remembered that. :) But I haven't had it set to DOWN since I got it working. Did you test this workaround and had it terminate the head node?

rexcsn commented 3 years ago

Hi @chambm ,

Sorry I was mistaken, current logic in clustermgtd only retrieves compute instances, so the head node instance cannot be terminated from clustermgtd. However, because of this, if NodeAddr of the head node configured in slurm ever pointed to the real head node instance, clustermgtd will set the node to DOWN because it cannot retrieve the head node instance, and thinks that there is no actual instance backing the head node configured in slurm

There are still some issues with the workaround that I would like to call out:

Configuring the head node with SuspendExcNodes essentially makes it a static node because it's excluded from the normal Suspend logic. This will also interfere with any other static nodes in the system because slurm will just overwrite SuspendExcNodes and remove other static nodes from that setting
Configuring the node name as a dynamic node master-dy-$master_type-1, makes the clustermgtd not treat the node as a Static node either, which means there is no process that will terminate the node.
Since NodeAddr is not configured, the mapping between master-dy-$master_type-1 and the actual head node instance is never established. Maybe the node is able to get used by slurm as a Dynamic node, which will trigger the Resume program (SuspendExcNodes does not prevent resume from running). However, in this case Resume program will launch a new instance to back master-dy-$master_type-1. So essentially a new dynamic node instance is launched but the actual head node instance is never used by slurm

We would like to point out again that configuring head node in slurm not a supported path, and a workaround will most likely require some custom changes on the node package logic.

Thank you!

aws / aws-parallelcluster

clustermgtd fails when I configure a custom partition for the Master node #2373