access-ci-org / Jetstream_Cluster

Scripts and Ansible Playbooks for building an HPC-style resource in Jetstream
MIT License
19 stars 16 forks source link

Fix scripts for Jetstream2 #17

Closed zacharygraber closed 4 months ago

zacharygraber commented 4 months ago

At present, the rocky-linux branch fails setup on Jetstream2 (e.g. through Exosphere) due to relying on old versions of the OpenStack API. This PR introduces a couple of changes that allow it to function again:

  1. Bumps openstacksdk and python-openstackclient up to the latest versions
  2. Fixes resulting broken syntax on the Ansible return. There is no convenient way to get the IPv4 address of the created instance without hardcoding the network name (or using a janky fix, like taking the first network in the list), but we can use the hostname (access_ipv4 returns null).
  3. Changes the name of the image build instance to be more consistent with the other compute nodes' naming scheme (${cluster-name}-compute-...).
  4. Removes what looks like an accidental hard-code of a cluster prefix from ssh.cfg

To Test

  1. Enable experimental features in Exosphere

  2. Create a new Jetstream2 instance from the Featured-RockyLinux8 image (through Exosphere)

  3. Set Create your own SLURM cluster with this instance as the head node to Yes

  4. In the Boot Script, replace {create-cluster-command} with:

    su - rocky -c "git clone --branch rocky-linux --single-branch --depth 1 https://github.com/zacharygraber/Jetstream_Cluster.git; cd Jetstream_Cluster; ./cluster_create_local.sh -d 2>&1 | tee local_create.log"
  5. Once setup finishes, verify that you can run Slurm jobs:

    sudo su - rocky
    cd Jetstream_Cluster
    sbatch ./slurm_test.job
zacharygraber commented 4 months ago

Diff looks good, and I believe you that it works!

@c-mart Appreciate the review/approval! Just to be clear: I don't have permission to merge into this repo.

julianpistorius commented 4 months ago

Should I be worried about this in the local_create.log file @zacharygraber?

TASK [add local users to compute node] *****************************************                                                                 
changed: [nicely-still-whippet-compute-base-instance] => {"changed": true, "rc": 0, "stderr": "Shared connection to nicely-still-whippet-compute-base-instance closed.\r\n", "stderr_lines": ["Shared connection to nicely-still-whippet-compute-base-instance closed."], "stdout": "Lmod has detected the\r\nfollowing error: The\r\nfollowing module(s) are\r\nunknown: \"xalt\"\r\n\r\nPlease check the spelling or\r\nversion number. Also try\r\n\"module spider ...\"\r\nIt is also possible your\r\ncache file is out-of-date; it\r\nmay help to try:\r\n  $ module --ignore_cache\r\nload \"xalt\"\r\n\r\nAlso make sure that all\r\nmodulefiles written in TCL\r\nstart with the string\r\n#%Module\r\n\r\n\r\n\r\n", "stdout_lines": ["Lmod has
detected the", "following error: The", "following module(s) are", "unknown: \"xalt\"", "", "Please check the spelling or", "version number. Also try", "\"module spider ...\"", "It is also possible your", "cache file is out-of-date; it", "may help to try:", "  $ module --ignore_cache", "load
\"xalt\"", "", "Also make sure that all", "modulefiles written in TCL", "start with the string", "#%Module", "", "", ""]}  

Update: Slurm test job still works... Hopefully it doesn't break something subtle.

julianpistorius commented 4 months ago

I don't have write access to this repo, so I can't merge it either.

@DImuthuUpe? @c-mart?

DImuthuUpe commented 4 months ago

Thanks @julianpistorius. I made you an admin

zacharygraber commented 4 months ago

Should I be worried about this in the local_create.log file @zacharygraber?

TASK [add local users to compute node] *****************************************                                                                 
changed: [nicely-still-whippet-compute-base-instance] => {"changed": true, "rc": 0, "stderr": "Shared connection to nicely-still-whippet-compute-base-instance closed.\r\n", "stderr_lines": ["Shared connection to nicely-still-whippet-compute-base-instance closed."], "stdout": "Lmod has detected the\r\nfollowing error: The\r\nfollowing module(s) are\r\nunknown: \"xalt\"\r\n\r\nPlease check the spelling or\r\nversion number. Also try\r\n\"module spider ...\"\r\nIt is also possible your\r\ncache file is out-of-date; it\r\nmay help to try:\r\n  $ module --ignore_cache\r\nload \"xalt\"\r\n\r\nAlso make sure that all\r\nmodulefiles written in TCL\r\nstart with the string\r\n#%Module\r\n\r\n\r\n\r\n", "stdout_lines": ["Lmod has
detected the", "following error: The", "following module(s) are", "unknown: \"xalt\"", "", "Please check the spelling or", "version number. Also try", "\"module spider ...\"", "It is also possible your", "cache file is out-of-date; it", "may help to try:", "  $ module --ignore_cache", "load
\"xalt\"", "", "Also make sure that all", "modulefiles written in TCL", "start with the string", "#%Module", "", "", ""]}  

Update: Slurm test job still works... Hopefully it doesn't break something subtle.

@julianpistorius Xalt is the tracking software we installed to get usage stats for the software share. These scripts install a bunch of stuff like openhpc, which I believe overrides our modulepath at /software (where Xalt is located), meaning that Lmod can't find the Xalt module (it tries to load it on login by default, since it only tracks if the module is loaded).

It's nothing really to worry about.

julianpistorius commented 4 months ago

@zacharygraber: It's nothing really to worry about.

Not sure if it's related, and I didn't get it while testing it using your instructions, but there seems to be a module-related problem: #18