Nodes in drain state - Githubissues

clusterinthecloud / terraform

Terraform config for Cluster in the Cloud

https://cluster-in-the-cloud.readthedocs.io

MIT License

20 stars 23 forks source link

Nodes in drain state #11

Closed mikeoconnor0308 closed 5 years ago

mikeoconnor0308 commented 5 years ago

Upon creation of a cluster, the nodes have gone into the drain state, and jobs are forever pending:

[opc@mgmt ~]$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
3                  test    compute                     2    PENDING      0:0

[opc@mgmt ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      3  drain compute[001-003]

What causes the nodes to enter this state, and how can it be remedied?

christopheredsall commented 5 years ago

Hi Mike,

Can you run this sinfo command and paste the output?

sinfo --list-reasons --Format=reason:40,nodelist

mikeoconnor0308 commented 5 years ago

ah, low memory?

[opc@mgmt ~]$ sinfo --list-reasons --Format=reason:40,nodelist
REASON                                  NODELIST            
Low RealMemory                          compute[001-003]

I had to add VM.GPU2.1 (the compute nodes I'm using), to shapes.yaml, which i configured as follows:

  # VM GPU
VM.GPU2.1: 
  memory: 104000
  cores_per_socket: 12 
  threads_per_core: 2

Is this related?

christopheredsall commented 5 years ago

Seems to be the case. I've just booted a VM.GPU2.1 and the free memory on that one is

[ce16990@compute003 ~]$ free -m
              total        used        free      shared  buff/cache   available
Mem:          72210         399       71148          57         662       70957
Swap:          8191           0        8191

So, 72000 is probably a better figure than 104000. If you want to send us a pull request for that we can merge it.

To fix up your current cluster there are a couple of options.

Easiest option would be do destroy and recreate it.
Alter the cluster in place

edit /mnt/shared/apps/slurm/slurm.conf to reduce the RealMemory value
sudo systemctl restart slurmctld
sudo scontrol update NodeName=compute[001-003] State=Resume

mikeoconnor0308 commented 5 years ago

ah thank you, i wasn't sure how the memory figure was calculated so made an educated (and evidently wrong!) guess.

christopheredsall commented 5 years ago

Yes, it is a bit trial and error! Also I was misreading the the above - 72210 is the total, < 71100 might be a better figure.