aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
828 stars 312 forks source link

OverSubscribe in ParallelCluster 3.8.0 does not work #6082

Open michaelmayer2 opened 8 months ago

michaelmayer2 commented 8 months ago

When trying to use OverSubscribe parameter via CustomSlurmSettings at the Queue level, the pcluster command produces a validation error

  "configurationValidationErrors": [
    {
      "level": "ERROR",
      "type": "CustomSlurmSettingsValidator",
      "message": "Using the following custom Slurm settings at Queue level is not allowed: OverSubscribe"
    },

If setting the same at ComputeResources level, the validator is ok with it but evidently this then leads to a broken SLURM configuration given that the OVERSUBSCRIBE parameter is only allowed at Queue/ Partition level.

In the pcluster source code we are associating OverSubscribe with the Queue level but put it on the DENY_LIST https://github.com/aws/aws-parallelcluster/blob/develop/cli/src/pcluster/validators/slurm_settings_validator.py#L50

If removing oversubscribe from the DENY_LIST, pcluster will create a valid SLURM configuration, e.g. using this partial YAML code

....
  SlurmQueues: 
    - Name: interactive 
      ComputeResources:
        - Name: interactive 
          InstanceType: t2.xlarge 
          MaxCount: 20
          MinCount: 1
          Efa:
            Enabled: FALSE
      CustomSlurmSettings:
        OverSubscribe: FORCE
...
hgreebe commented 8 months ago

The reason we haveOverSubscribe in the deny list is because it is mutually exclusive with the JobExclusiveAllocation parameter.

If you really want to use OverSubscribe, make sure you don't have JobExclusiveAllocation and suppress the validator when creating the cluster.

michaelmayer2 commented 7 months ago

Fair enough - happy to use the --validation-failure-level WARNING workaround for the time being.

It still would be great to add OverSubscribe on the Queue/Partition level though - adding OverSubscribe there will provide a non-functional SLURM cluster, even with validators as default.