Closed scott-nag closed 1 month ago
Hi @scott-nag , thank you for reporting! To be fixed by https://github.com/GoogleCloudPlatform/slurm-gcp/pull/194
this is working now perfectly, thank you for the quick fix!
https://github.com/GoogleCloudPlatform/slurm-gcp/pull/194 is included in the latest release.
Describe the bug
Module scripts located in
community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/
(develop branch)I am creating a v6 cluster using pre-existing-network-storage with the
server_ip
set to$controller
in the blueprint. However the start up script fails to mount the storage and times outI believe the
server_ip
in the storage mounts should contain the host name instead of$controller
, similar to how the second mount showscluster9f3-controller
successfully here?Steps to reproduce
Expected behavior
Storage should be successfully mounted and timeout should not happen
Actual behavior
Timeouts as shown in the above logs
Version (
gcluster --version
)gcluster version - not built from official release Built from 'develop' branch. Commit info: v1.37.1-167-g1d7dc338-dirty Terraform version: 1.9.3
(tested with Terraform 1.4 too)
Blueprint
If applicable, attach or paste the blueprint YAML used to produce the bug.
Output and logs
N/A - blueprint is successfully deployed
Execution environment
Other info
I have added a quick fix to the
resolve_network_storage
function that is located insetup_network_storage.py
(the "for mount in mounts.values()" loop) as I noticed similar logic relating to$controller
inutil.py
This successfully gets the startup scripts to run and gets the login and controller nodes online
Unfortunately Slurm still isn't configured correctly as shown below, so it is possibly not being replaced elsewhere in the module too. Happy to provide more info if required.