aws-samples / aws-eda-slurm-cluster

AWS Slurm Cluster for EDA Workloads
MIT No Attribution
28 stars 7 forks source link

Improve documentation for ClusterConfig section #272

Closed gwolski closed 1 week ago

gwolski commented 1 week ago

I'm trying to increase the timeout of the ScaledownIdletime.

I added the following ClusterConfig/SlurmSettings/ScaledownIdletime to my cluster config file:

ParallelClusterConfig:
  Version: 3.11.1
  Architecture: x86_64
  Image:
    Os: rocky8
    CustomAmi: ami-0d68c6538XXXXXXX  # pcluster-3-11-1-Rocky-8-x86-64-ami-0d002XXXXXXXXXX 2024-10-24T03-12-55.412Z
  DisableSimultaneousMultithreading: true
  ClusterConfig:
    SlurmSettings:
      ScaledownIdletime: 20

I've discovered this is the wrong syntax. Your documentation only states make ClusterConfig a dict. I look at the config_schema.py and not much there to go on either. I've tried multiple variations, including:

ClusterConfig:
  Scheduling: 
    SlurmSettings:
      ScaledownIdletime: 20

just can't figure it out. This latter example at least throws an error by the python code.

I have been able to get the simple case of tags to work:

ClusterConfig:
  Tags:
    - Key: Project
      Value: amazing

Can you please add some examples (specifically my need) to your documentation and this issue?

How do I add a section in the config file to change the ScaledownIdletime?

cartalla commented 1 week ago

Your code looks correct. I'm testing it right now.

ClusterConfig:
  Scheduling: 
    SlurmSettings:
      ScaledownIdletime: 20
cartalla commented 1 week ago

I made the change in my configuration. I downloaded the config file that get generated and confirmed that the setting shows up in the ParallelCluster config file. I updated my cluster and it successfully updated the config and ParallelCluster. I checked in slurm_parallelcluster.conf and confirmed that SuspendTime is set to 1200 seconds which is 20 minutes. So I think that it is working. I was initially a little confused because there is no ScaledownIdletime parameter in slurm.conf. The slurm parameter is SuspendTime.

gwolski commented 1 week ago

PLBKAC. Solved.

Here is the error I was getting, it happens right after the AMI builds section is output (I've copied and pasted a bit of that here to give you context):

"Rocky": {
    "8": {
        "arm64": {},
        "x86_64": {}
    },
    "9": {
        "arm64": {},
        "x86_64": {}
    }
}

} Traceback (most recent call last): File "/proj/work/gwolski/aws-eda-slurm-cluster-3.11.1/source/app.py", line 31, in CdkSlurmStack(app, app.node.try_get_context('stack_name'), env=cdk_env, File "/users/gwolski/.local/lib/python3.11/site-packages/jsii/_runtime.py", line 118, in call inst = super(JSIIMeta, cast(JSIIMeta, cls)).call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/proj/work/gwolski/aws-eda-slurm-cluster-3.11.1/source/cdk/cdk_slurm_stack.py", line 143, in init self.create_parallel_cluster_config() File "/proj/work/gwolski/aws-eda-slurm-cluster-3.11.1/source/cdk/cdk_slurm_stack.py", line 2557, in create_parallel_cluster_config self.parallel_cluster_config['Scheduling']['Scheduler'] = 'slurm'


TypeError: list indices must be integers or slices, not str

Subprocess exited with error 1

I tried so many variants, I must have copied and pasted the wrong code in my issue here.  Here is the offensive code that caused the above error that I should have realized is wrong.

```
ClusterConfig:
  Scheduling: 
    - SlurmSettings:
        ScaledownIdletime: 20
```

Note the '-' in front of the SlurmSettings.  Argh.   Damn (tired) user.  Never file a ticket when you are tired. Thank you.

I have now used the appropriate code, as you have shown, and I see the correct entry in the YAML file when downloaded with PCUI and also the value SuspendTime=1200 in the slurm_parallelcluster.conf file.  All good.