job stays in CF status with the default setting

fengggli commented 4 years ago

I followed the README to setup a slurm cluster using my jetstream xsede allocation. When the install.sh finished, I check the cluster status with "sinfo", it said there were two compute nodes in "idle" state.

Then I submitted a test job(the sample job in the source tree)

sbatch slurm_test.job

But it stays in "CF" state forever

[centos@cyberwater-slurm-jetstream CRI_Jetstream_Cluster]$ squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2     cloud slurm_te   centos CF       9:28      1 tg837458-compute-0
[centos@cyberwater-slurm-jetstream CRI_Jetstream_Cluster]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
cloud*       up   infinite      1 alloc# tg837458-compute-0
cloud*       up   infinite      1  idle~ tg837458-compute-1

The node list is empty

[centos@cyberwater-slurm-jetstream CRI_Jetstream_Cluster]$ scontrol show hostname
scontrol: error: host list is empty

So I guess I need to add compute nodes manually to the cluster somehow? (I saw some snapshot images are created during the installation, am I just going to launch the instances from web portal, and they will be managed by the headnode automatically?)

Thanks! Feng

ECoulter commented 4 years ago

Hm, that's frustrating. The nodes shouldn't stay in 'CF' state more than a minute or two. Is there any output in /var/log/slurm/slurm_elastic.log ?

Are the compute nodes created in Openstack? ("source openrc.sh && openstack server list")

The 'scontrol show hostname' command is actually just a cli tool for expanding hostname blobs - the compute nodes are registered in slurmctld as shown by 'sinfo', but something has gone wrong during creation. The instances should be launched automatically when you submit jobs to slurm via the web portal, yes, and the image snapshot is used for the base image of those instances. :)

Thanks for following up; hopefully we can get this nailed down quickly!

fengggli commented 4 years ago

Hi Eric,

The log shows

[centos@cyberwater-slurm-jetstream ~]$ cat /var/log/slurm/slurm_elastic.log 
Node resume invoked: /usr/local/sbin/slurm_resume.sh tg837458-compute-0
creating tg837458-compute-0
No password entered, or found via --os-password or OS_PASSWORD
Mon Mar  2 14:11:01 UTC 2020 Node suspend invoked: ./slurm_suspend.sh 
Node resume invoked: ./slurm_resume.sh 
Node resume invoked: /usr/local/sbin/slurm_resume.sh tg837458-compute-1
creating tg837458-compute-1
No password entered, or found via --os-password or OS_PASSWORD

I checked with the openstack server list: only one instance is running (cyberwater-slurm-jetstream, the head node, JS-API-Featured-CentOS7-Feb-12-2020 | m1.small with status "ACTIVE"), Thanks

ECoulter commented 4 years ago

Aha! So, it's broken because the openrc is asking for a password during automated node creation. I just pushed a change that should fix how the openrc file used by slurm is created, but if you're attached to the current cluster, you can change the file /etc/slurm/openrc.sh to contain:


export OS_USER_DOMAIN_NAME=tacc
export OS_PROJECT_NAME=${OS_PROJECT_NAME}
export OS_USERNAME=${OS_PROJECT_NAME}
export OS_PASSWORD=${OS_PASSWORD}
export OS_AUTH_URL=${OS_AUTH_URL}
export OS_IDENTITY_API_VERSION=3```

but with all of the variables (enclosed in ${}) replaced with the actual values for your allocation, and then make sure the openrc is only useable by the slurm user via

```sudo chmod 600 /etc/slurm/openrc.conf```
```sudo chown slurm:slurm /etc/slurm/openrc.conf```

fengggli commented 4 years ago

Thanks for the fix,

but I think this can be a typo? https://github.com/XSEDE/CRI_Jetstream_Cluster/blob/a914cc084fdd665a56ebb308aafff3846621caf2/install.sh#L153

shall be:

export OS_USERNAME=${OS_USERNAME}

In the case when I am "attached to the current cluster", what actions can i take for the modified /etc/slurm/openrc.sh to take effect (to avoid delete/recreate head node)? I tried the following but didn't work:

[centos@js-slurm-cluster CRI_Jetstream_Cluster]$ sudo ./slurm_suspend.sh 
scontrol: error: host list is empty
[centos@js-slurm-cluster CRI_Jetstream_Cluster]$ sudo ./slurm_resume.sh 
scontrol: error: host list is empty

Thanks! Feng

fengggli commented 4 years ago

@ECoulter Hi Eric, My Slurm cluster runs correctly now. But I still have a few questions:

Based on the log in (/var/log/slurm/slurm_elastic.log), when I submit the slurm_test.job, one compute node instance is created on-the-fly, after the job is done, that instance is deleted, am I understanding it correctly?
So if i need to add a yum package, I don't need to suspend the cluster at all: I just need to modify the compute_build_base_img.yml. When the next job is available in the queue, a new instance will be created using the updated new yml file, right?
After slurm_test.job finished, I checked the output file, and It complains that module command is not found,
```
[centos@cyberwater-slurm-cluster CRI_Jetstream_Cluster]$ cat nodes_2.out 
/tmp/slurmd/job00002/slurm_script: line 5: module: command not found
/tmp/slurmd/job00002/slurm_script: line 6: module: command not found
/tmp/slurmd/job00002/slurm_script: line 8: mpirun: command not found
```
but it's shall already be provided by the lmod-ohpc, right? https://github.com/XSEDE/CRI_Jetstream_Cluster/blob/a914cc084fdd665a56ebb308aafff3846621caf2/compute_build_base_img.yml#L58-L66,

Thanks for your help again!

ECoulter commented 4 years ago

@fengggli Hi Feng!

Yes, that's exactly right!
Partially true - you actually have to rebuild the compute image yourself if you've added something, by running the ansible playbook (which creates an instance, modifies, and saves as a new image). "sudo ansible-playbook -v compute_build_base_img.yml" will do the trick. Sudo is still needed b/c it uses the slurm-key (the ssh key used by slurm to access compute nodes) to access the temporary instance which is created. Once the new image is created, the next compute node to be created will use it. The images are labelled by date, so you can switch back to the old image if something breaks, but images created on the same day will overwrite each other for now (this is done in compute_take_snapshot.sh).
Was this immediately after the build? I've seen this happen after installing lmod-ohpc on the headnode, but not logging out/in again, so that the module command isn't defined in the submitting shell environment, so it isn't picked up on the compute node either. The module command is actually a shell function that gets defined when /etc/profile is sourced on login, so it's a little different from most commands. If it happens again after logging out and back in (and submitting a new job), then something is broken - I'd double check that lmod-ohpc is installed & working on the headnode first.

Cheers, Eric C.

fengggli commented 4 years ago

@ECoulter Hi Eric, Thanks for explaining 2, and it's clear to me now. For 3, it happened immediately after the build, today I re-logged in and "module load" works correctly~ Everything looks good now. Best, Feng

ECoulter commented 4 years ago

Great - Feel free to reach out if you have other questions! Cheers, Eric.

access-ci-org / Jetstream_Cluster

job stays in CF status with the default setting #4