aws-samples / aws-eda-slurm-cluster

AWS Slurm Cluster for EDA Workloads
MIT No Attribution
23 stars 7 forks source link

[BUG] slurm_zfs.yml doesn't work #203

Closed kaisenl closed 3 months ago

kaisenl commented 5 months ago

Describe the bug I'm trying to run it with slurm_zfs.yml and it errors out.

To Reproduce

[ec2-user@ip aws-eda-slurm-cluster]$ ./install.sh --config-file=slurm_zfs.yml
~/aws-eda-slurm-cluster ~/aws-eda-slurm-cluster
Using python 3.7.16
Using nodejs version 16.20.2
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!                                                                            !!
!!  Node 16 has reached end-of-life on 2023-09-11 and is not supported.       !!
!!  Please upgrade to a supported node version as soon as possible.           !!
!!                                                                            !!
!!  This software is currently running on node v16.20.2.                      !!
!!  As of the current release of this software, supported node releases are:  !!
!!  - ^20.0.0 (Planned end-of-life: 2026-04-30)                               !!
!!  - ^18.0.0 (Planned end-of-life: 2025-04-30)                               !!
!!                                                                            !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Using CDK 2.91.0
make: `.requirements_installed' is up to date.
~/aws-eda-slurm-cluster
INFO:
Working directory: /home/ec2-user/aws-eda-slurm-cluster/source
INFO:
====== Validating AWS environment ======

INFO: Using config: /home/ec2-user/aws-eda-slurm-cluster/source/resources/config/slurm_zfs.yml
/home/ec2-user/aws-eda-slurm-cluster/source/.venv/lib64/python3.7/site-packages/boto3/compat.py:82: PythonDeprecationWarning: Boto3 will no longer support Python 3.7 starting December 13, 2023. To continue receiving service updates, bug fixes, and security updates please upgrade to Python 3.8 or later. More information can be found here: https://aws.amazon.com/blogs/developer/python-support-policy-updates-for-aws-sdks-and-tools/
  warnings.warn(warning, PythonDeprecationWarning)
Traceback (most recent call last):
  File "source/installer.py", line 31, in <module>
    app.main()
  File "/home/ec2-user/aws-eda-slurm-cluster/source/slurm_installer/installer.py", line 114, in main
    self.config = self.get_config(args.config_file)
  File "/home/ec2-user/aws-eda-slurm-cluster/source/slurm_installer/installer.py", line 449, in get_config
    validated_config = check_schema(config_parameters)
  File "/home/ec2-user/aws-eda-slurm-cluster/source/cdk/config_schema.py", line 550, in check_schema
    config_schema = get_config_schema(config_in)
  File "/home/ec2-user/aws-eda-slurm-cluster/source/cdk/config_schema.py", line 444, in get_config_schema
    Optional('SlurmRestApiVersion', default=get_slurm_rest_api_version(config)): str,
  File "/home/ec2-user/aws-eda-slurm-cluster/source/cdk/config_schema.py", line 146, in get_slurm_rest_api_version
    slurm_version = get_PC_SLURM_VERSION(config)
  File "/home/ec2-user/aws-eda-slurm-cluster/source/cdk/config_schema.py", line 142, in get_PC_SLURM_VERSION
    parallel_cluster_version = get_parallel_cluster_version(config)
  File "/home/ec2-user/aws-eda-slurm-cluster/source/cdk/config_schema.py", line 127, in get_parallel_cluster_version
    return config['slurm']['ParallelClusterConfig'].get('Version', str(DEFAULT_PARALLEL_CLUSTER_VERSION))
KeyError: 'ParallelClusterConfig'

Expected behavior Something that isn't an error

Repository Version 9d3c16faaad48d93f12ad131c367a31dbb4ddac2

Additional context Running in an ALinux2 instance

cartalla commented 5 months ago

Please use default_config.yml as your starting point. I'll clean up the old config file that worked with v1.

kaisenl commented 5 months ago

Thanks! I've already got the minimal default_config.yml working, so I'll wait for the updated configs so I can fill in the rest of our resources like FSx, RDS, etc