Open michaelmayer2 opened 8 months ago
The reason we haveOverSubscribe
in the deny list is because it is mutually exclusive with the JobExclusiveAllocation
parameter.
If you really want to use OverSubscribe
, make sure you don't have JobExclusiveAllocation
and suppress the validator when creating the cluster.
Fair enough - happy to use the --validation-failure-level WARNING
workaround for the time being.
It still would be great to add OverSubscribe
on the Queue
/Partition
level though - adding OverSubscribe
there will provide a non-functional SLURM cluster, even with validators as default.
When trying to use
OverSubscribe
parameter viaCustomSlurmSettings
at theQueue
level, thepcluster
command produces a validation errorIf setting the same at
ComputeResources
level, the validator is ok with it but evidently this then leads to a broken SLURM configuration given that theOVERSUBSCRIBE
parameter is only allowed atQueue
/Partition
level.In the
pcluster
source code we are associatingOverSubscribe
with theQueue
level but put it on theDENY_LIST
https://github.com/aws/aws-parallelcluster/blob/develop/cli/src/pcluster/validators/slurm_settings_validator.py#L50If removing
oversubscribe
from theDENY_LIST
,pcluster
will create a valid SLURM configuration, e.g. using this partial YAML code