Closed mikeoconnor0308 closed 5 years ago
Hi Mike,
Can you run this sinfo
command and paste the output?
sinfo --list-reasons --Format=reason:40,nodelist
ah, low memory?
[opc@mgmt ~]$ sinfo --list-reasons --Format=reason:40,nodelist
REASON NODELIST
Low RealMemory compute[001-003]
I had to add VM.GPU2.1 (the compute nodes I'm using), to shapes.yaml, which i configured as follows:
# VM GPU
VM.GPU2.1:
memory: 104000
cores_per_socket: 12
threads_per_core: 2
Is this related?
Seems to be the case. I've just booted a VM.GPU2.1 and the free memory on that one is
[ce16990@compute003 ~]$ free -m
total used free shared buff/cache available
Mem: 72210 399 71148 57 662 70957
Swap: 8191 0 8191
So, 72000 is probably a better figure than 104000. If you want to send us a pull request for that we can merge it.
To fix up your current cluster there are a couple of options.
/mnt/shared/apps/slurm/slurm.conf
to reduce the RealMemory
valuesudo systemctl restart slurmctld
sudo scontrol update NodeName=compute[001-003] State=Resume
ah thank you, i wasn't sure how the memory figure was calculated so made an educated (and evidently wrong!) guess.
Yes, it is a bit trial and error! Also I was misreading the the above - 72210 is the total, < 71100 might be a better figure.
Upon creation of a cluster, the nodes have gone into the drain state, and jobs are forever pending:
What causes the nodes to enter this state, and how can it be remedied?