aws-samples / aws-eda-slurm-cluster

AWS Slurm Cluster for EDA Workloads
MIT No Attribution
23 stars 7 forks source link

203 bug slurm zfsyml doesnt work #214

Closed cartalla closed 3 months ago

cartalla commented 4 months ago

Update config files and fix errors found in testing new configs

Add --RESEnvironmentName to the installer

Ease initial integration with Research and Engineering Studio (RES).

Automatically add the correct submitter security groups and configure the /home directory.

Automatically choose the subnets if not specified based on RES subnets.

Resolves #207

============================

Update template config files

Added more comments to clarify that these are examples that should be copied and customized by users.

Added comments for typical configuration options.

Deleted obsolete configs that were from v1.

Resolves #203

=============================

Set default head node instance type based on architecture.

Resolves #206

==============================

Clean up ansible-lint errors and warnings. Arm architecture cluster was failing because of an incorrect condition in the ansible playbook that is flagged by lint.

==============================

Use vdi controller instead of cluster manager for users and groups info

Cluster manager stopped being domain joined for some reason.

==============================

Paginate describe_instances when creating head node a record.

Otherwise, may not find the cluster head node instance.

==============================

Add default MungeKeySecret.

This should be the default or you can't access multiple clusters from the same server.

==============================

Increase timeout for ssm command that configures submitters

Need the time to compile slurm.

==============================

Force slurm to be rebuilt for submitters of all os distributions even if they match the os of the cluster.

Otherwise get errors because can't find PluginDir in the same location as when it was compiled.

==============================

Paginate describe_instances in UpdateHeadNode lambda

==============================

Add check for min memory of 4 GB for slurm controller

==============================

Sync EC2InstanceTypeInfo.py with hpc-cost-simulator.

==============================

Update documentation.

Remove Regions from InstanceConfig. This was left over from legacy cluster and ParallelCluster doesn't support multiple regions.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.