aws-samples / aws-eda-slurm-cluster

AWS Slurm Cluster for EDA Workloads
MIT No Attribution
23 stars 7 forks source link

Documentation corrections required on deploy-parallel-cluster documentation page #222

Closed gwolski closed 1 month ago

gwolski commented 2 months ago

On the page

https://aws-samples.github.io/aws-eda-slurm-cluster/deploy-parallel-cluster/

Three issues that I ran into:

1) The Create users_groups.json secion has a duplicate of the table used later in "Configure submission hosts to use the cluster". It doesn't belong here..

2) The Description for the Config Stack Output states that Command01SubmitterMountHeadNode will "adds it to /etc/fstab". It does not. The command just mounts the file system:

head_ip=head_node..pcluster && sudo mkdir -p /opt/slurm/ && sudo mount $head_ip:/opt/slurm /opt/slurm/

(I've replaced my clusterName with )

3) After I have run the ansible playbook, I tried to load the module as specified in "Run Your First Job". This did not work:

$ module load ERROR: Unable to locate a modulefile for ''

I had to logout and log back in to get my environment set correctly to allow the module to be loaded and do its magic.

gwolski commented 2 months ago

For item 2 above, I have found that running the ansible playbook adds the mount to my /etc/fstab. So you just have the comment in the wrong Description section...

cartalla commented 1 month ago

The tables aren't duplicates. Only the first item is the same. However, the name of the command is misleading. So I renamed it from Command01_SubmitterMountHeadNode to Command01_MountHeadNodeNfs.

I updated the description for the 1st and 2nd commands in both tables. The /etc/fstab update occurs in the 2nd step when the ansible playbook is run.

When the new modulefile is created you need to create a new shell to refresh the environment or source the shell config again.