Closed chaoyanghe closed 4 years ago
I reduce to 4 nodes for my job, it works now. But I can't scale up to 7 nodes since all others are showing "idle#" all the time.
By the way, I also found when my job read the data from "/mnt/shared/" directory is very slow. Is there any other method to speed up this process? or I must upload data to each node's local directory?
Here is the log I read data on the slurm node. Normally in my local computer, it does not take so much time. 0 - 2019-04-29 18:56:11,157:model_utils.py:48:INFO: read_data. f = all_data_9_niid_0_keep_0_train_9.json 0 - 2019-04-29 18:56:13,025:model_utils.py:48:INFO: read_data. f = all_data_5_niid_0_keep_0_train_9.json
Christopher - Chaoyang is not able to scale beyond 7 nodes, as the other nodes shows status = IDLE#. Is there a Slurm parameter / configuration setting that is preventing the scaleup beyond 7 nodes?
We suspect there may be a rate limiting issue in the OCI cloud when starting up a number of nodes at once. Also the time from when Slurm wants to start the nodes until all the nodes are up to the point where slurmd
is running can take a long time. If this exceeds the ResumeTimeout
in /mnt/shared/etc/slurm/slurm.conf
(by default set to 600) seconds then Slurm will decide that any nodes not started are faulty and will mark them as "down".
We have some options on the CitC side to alleviate this, but you could try increasing ResumeTimeout
in /mnt/shared/etc/slurm/slurm.conf
from 600 seconds to, say, 1200 and running
sudo systemctl restart slurmctld
@jtsaismith - We haven't been able to find documentation on the rate limits, are these available? Would it be possible, in priciple for thse to be increased?
For your current cluster, you'll need to tell Slurm that the nodes that failed to start are actually OK by running
sudo scontrol update NodeNames= vm-standard2-24-ad1-[0001-0010],vm-standard2-24-ad2-[0001-0004,0007,0009-0010] State=Resume
@christopheredsall - thanks for the suggestions. When you say "rate limiting" could you define "rate"? Is it the rate which new nodes (instances) are started? As far as I'm aware, there is no limit.
Is there a way to start and keep ALL nodes in the cluster running at all times? In other words, this cluster can have up to 23 nodes running simultaneously. Can they all be started and running all the time?
In my setting, I need to keep all the compute nodes running. How to set this?
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute up infinite 18 idle# vm-standard2-24-ad1-[0001-0010],vm-standard2-24-ad2-[0001-0004,0007-0010] compute up infinite 5 idle vm-standard2-24-ad2-[0005-0006],vm-standard2-24-ad3-[0001-0003]
I tried "sudo scontrol update", but it still shows like above info. 18 nodes are always "idle#" Could you help to keep all nodes running?
@christopheredsall Hi, I have a deadline recently, could you help to check these two issues: 1) 18 nodes are always showing "idle#". They can not be allocated. How to setup all 23 nodes running all the time? Since my experiment needs them to run simultaneously. 2) The "Unable to resolve "mgmt": Unknown host" causes me can't run any job.
Answering the simple question first: how to ensure the nodes stay on. At the moment Slurm will turn off a node that hasn't had a job run on it for SuspendTime
seconds. This is currently set to 900 seconds in /mnt/shared/etc/slurm/slurm.conf
. For your case you could set it to a much larger number and then run sudo systemctl restart slurmcltd
The Unable to resolve "mgmt": Unknown host
error is #23 and I'll answer that in the other issue. This is probably the cause of the currently idle#
nodes not being able to run jobs.
@christopheredsall Thanks. Now the current status: PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute up infinite 18 idle# vm-standard2-24-ad1-[0001-0010],vm-standard2-24-ad2-[0001-0004,0007-0010] compute up infinite 5 idle vm-standard2-24-ad2-[0005-0006],vm-standard2-24-ad3-[0001-0003]
And then I submit my job, the output log also shows like this:
srun: error: Unable to resolve "mgmt": Unknown host srun: error: Unable to establish control machine address srun: error: Unable to confirm allocation for job 92: No error srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 92 slurmstepd: error: Unable to resolve "mgmt": Unknown hos
This has been fixed in the latest versions of CitC. Both in the reporting on the state of the nodes (list_nodes
) and the DNS issue causing the problem in the last comment.
[root@mgmt FederatedLearning]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute up infinite 17 idle# vm-standard2-24-ad1-[0001-0010],vm-standard2-24-ad2-[0001-0004,0007,0009-0010] compute up infinite 6 idle vm-standard2-24-ad2-[0005-0006,0008],vm-standard2-24-ad3-[0001-0003] [root@mgmt FederatedLearning]# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 compute fl root PD 0:00 9 (Resources)
[opc@mgmt FederatedLearning]$ ping 10.1.0.5
PING 10.1.0.5 (10.1.0.5) 56(84) bytes of data.
^C
--- 10.1.0.5 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3068ms