aws-samples / aws-eda-slurm-cluster

AWS Slurm Cluster for EDA Workloads
MIT No Attribution
23 stars 7 forks source link

[BUG] Can only configure 3 clusters on a submitter host #204

Closed cartalla closed 1 month ago

cartalla commented 4 months ago

Describe the bug

If you try to configure a submitter host as a login node for a 4th cluster, the stack fails with the following error:

Resource handler returned message: "The maximum number of rules per security group has been reached. (Service: Ec2, Status Code: 400, Request ID: 3ab7f4c4-cdea-4ada-97fe-f650b116f7f1)" (RequestToken: ff9bc132-d76c-50cd-9f8c-41db329f01d1, HandlerErrorCode: ServiceLimitExceeded)

The stack is adding rules to the security groups configured in slurm/SubmitterSecurityGroupIds to allow connections to the head nodes NFS exports and Slurm controller ports. The head node security group has ingress rules allowing connections from submitters. I suspect that usually there would only be 1 security group for user remote desktops (DCV, VNC, etc.). So, I don't anticipate hitting the limit on the head node's ingress rules.

However, multiple security group rules are added to the submitter's security group for each cluster. Right now that limit is exceeded with the 4th configured cluster.

Each cluster also creates a submitter security group that could be attached to the submitter instances, but I think that there is a limit of 5 security groups that can be attached to an instance, so using those security groups would still hit a limit after a small number of clusters.

I think that the best way to handle this may be to create the required security group rules without hitting limits is to configure a single submitter security group that is allowed access to the cluster. The same security group could be used for multiple clusters. Each cluster would only be accessible to submitter hosts that belong to the configured security group. But, that security group should have a destination security group in the cluster that would also have to be passed to the cluster.

Need to think about this a bit to figure out a solution that is relatively easy to implement and use.