Closed fengggli closed 4 years ago
Hm, that's frustrating. The nodes shouldn't stay in 'CF' state more than a minute or two. Is there any output in /var/log/slurm/slurm_elastic.log ?
Are the compute nodes created in Openstack? ("source openrc.sh && openstack server list")
The 'scontrol show hostname' command is actually just a cli tool for expanding hostname blobs - the compute nodes are registered in slurmctld as shown by 'sinfo', but something has gone wrong during creation. The instances should be launched automatically when you submit jobs to slurm via the web portal, yes, and the image snapshot is used for the base image of those instances. :)
Thanks for following up; hopefully we can get this nailed down quickly!
Hi Eric,
[centos@cyberwater-slurm-jetstream ~]$ cat /var/log/slurm/slurm_elastic.log
Node resume invoked: /usr/local/sbin/slurm_resume.sh tg837458-compute-0
creating tg837458-compute-0
No password entered, or found via --os-password or OS_PASSWORD
Mon Mar 2 14:11:01 UTC 2020 Node suspend invoked: ./slurm_suspend.sh
Node resume invoked: ./slurm_resume.sh
Node resume invoked: /usr/local/sbin/slurm_resume.sh tg837458-compute-1
creating tg837458-compute-1
No password entered, or found via --os-password or OS_PASSWORD
Aha! So, it's broken because the openrc is asking for a password during automated node creation. I just pushed a change that should fix how the openrc file used by slurm is created, but if you're attached to the current cluster, you can change the file /etc/slurm/openrc.sh to contain:
export OS_USER_DOMAIN_NAME=tacc
export OS_PROJECT_NAME=${OS_PROJECT_NAME}
export OS_USERNAME=${OS_PROJECT_NAME}
export OS_PASSWORD=${OS_PASSWORD}
export OS_AUTH_URL=${OS_AUTH_URL}
export OS_IDENTITY_API_VERSION=3```
but with all of the variables (enclosed in ${}) replaced with the actual values for your allocation, and then make sure the openrc is only useable by the slurm user via
```sudo chmod 600 /etc/slurm/openrc.conf```
```sudo chown slurm:slurm /etc/slurm/openrc.conf```
Thanks for the fix,
but I think this can be a typo? https://github.com/XSEDE/CRI_Jetstream_Cluster/blob/a914cc084fdd665a56ebb308aafff3846621caf2/install.sh#L153
shall be:
export OS_USERNAME=${OS_USERNAME}
In the case when I am "attached to the current cluster", what actions can i take for the modified /etc/slurm/openrc.sh to take effect (to avoid delete/recreate head node)? I tried the following but didn't work:
[centos@js-slurm-cluster CRI_Jetstream_Cluster]$ sudo ./slurm_suspend.sh
scontrol: error: host list is empty
[centos@js-slurm-cluster CRI_Jetstream_Cluster]$ sudo ./slurm_resume.sh
scontrol: error: host list is empty
Thanks! Feng
@ECoulter Hi Eric, My Slurm cluster runs correctly now. But I still have a few questions:
After slurm_test.job finished, I checked the output file, and It complains that module command is not found,
[centos@cyberwater-slurm-cluster CRI_Jetstream_Cluster]$ cat nodes_2.out
/tmp/slurmd/job00002/slurm_script: line 5: module: command not found
/tmp/slurmd/job00002/slurm_script: line 6: module: command not found
/tmp/slurmd/job00002/slurm_script: line 8: mpirun: command not found
but it's shall already be provided by the lmod-ohpc, right? https://github.com/XSEDE/CRI_Jetstream_Cluster/blob/a914cc084fdd665a56ebb308aafff3846621caf2/compute_build_base_img.yml#L58-L66,
Thanks for your help again!
@fengggli Hi Feng!
Cheers, Eric C.
@ECoulter Hi Eric, Thanks for explaining 2, and it's clear to me now. For 3, it happened immediately after the build, today I re-logged in and "module load" works correctly~ Everything looks good now. Best, Feng
Great - Feel free to reach out if you have other questions! Cheers, Eric.
I followed the README to setup a slurm cluster using my jetstream xsede allocation. When the install.sh finished, I check the cluster status with "sinfo", it said there were two compute nodes in "idle" state.
Then I submitted a test job(the sample job in the source tree)
But it stays in "CF" state forever
The node list is empty
So I guess I need to add compute nodes manually to the cluster somehow? (I saw some snapshot images are created during the installation, am I just going to launch the instances from web portal, and they will be managed by the headnode automatically?)
Thanks! Feng