clusterinthecloud / terraform

Terraform config for Cluster in the Cloud
https://cluster-in-the-cloud.readthedocs.io
MIT License
20 stars 23 forks source link

my slurm job is always in the "PD" state for many hours and most of the compute nodes are idle# #19

Closed chaoyanghe closed 4 years ago

chaoyanghe commented 5 years ago

[root@mgmt FederatedLearning]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute up infinite 17 idle# vm-standard2-24-ad1-[0001-0010],vm-standard2-24-ad2-[0001-0004,0007,0009-0010] compute up infinite 6 idle vm-standard2-24-ad2-[0005-0006,0008],vm-standard2-24-ad3-[0001-0003] [root@mgmt FederatedLearning]# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 compute fl root PD 0:00 9 (Resources)

[opc@mgmt FederatedLearning]$ ping 10.1.0.5

PING 10.1.0.5 (10.1.0.5) 56(84) bytes of data.

^C

--- 10.1.0.5 ping statistics ---

4 packets transmitted, 0 received, 100% packet loss, time 3068ms

chaoyanghe commented 5 years ago

I reduce to 4 nodes for my job, it works now. But I can't scale up to 7 nodes since all others are showing "idle#" all the time.

By the way, I also found when my job read the data from "/mnt/shared/" directory is very slow. Is there any other method to speed up this process? or I must upload data to each node's local directory?

Here is the log I read data on the slurm node. Normally in my local computer, it does not take so much time. 0 - 2019-04-29 18:56:11,157:model_utils.py:48:INFO: read_data. f = all_data_9_niid_0_keep_0_train_9.json 0 - 2019-04-29 18:56:13,025:model_utils.py:48:INFO: read_data. f = all_data_5_niid_0_keep_0_train_9.json

jtsaismith commented 5 years ago

Christopher - Chaoyang is not able to scale beyond 7 nodes, as the other nodes shows status = IDLE#. Is there a Slurm parameter / configuration setting that is preventing the scaleup beyond 7 nodes?

christopheredsall commented 5 years ago

We suspect there may be a rate limiting issue in the OCI cloud when starting up a number of nodes at once. Also the time from when Slurm wants to start the nodes until all the nodes are up to the point where slurmd is running can take a long time. If this exceeds the ResumeTimeout in /mnt/shared/etc/slurm/slurm.conf (by default set to 600) seconds then Slurm will decide that any nodes not started are faulty and will mark them as "down".

We have some options on the CitC side to alleviate this, but you could try increasing ResumeTimeout in /mnt/shared/etc/slurm/slurm.conf from 600 seconds to, say, 1200 and running

sudo systemctl restart slurmctld

@jtsaismith - We haven't been able to find documentation on the rate limits, are these available? Would it be possible, in priciple for thse to be increased?

christopheredsall commented 5 years ago

For your current cluster, you'll need to tell Slurm that the nodes that failed to start are actually OK by running

sudo scontrol update NodeNames= vm-standard2-24-ad1-[0001-0010],vm-standard2-24-ad2-[0001-0004,0007,0009-0010] State=Resume
jtsaismith commented 5 years ago

@christopheredsall - thanks for the suggestions. When you say "rate limiting" could you define "rate"? Is it the rate which new nodes (instances) are started? As far as I'm aware, there is no limit.

Is there a way to start and keep ALL nodes in the cluster running at all times? In other words, this cluster can have up to 23 nodes running simultaneously. Can they all be started and running all the time?

chaoyanghe commented 5 years ago

In my setting, I need to keep all the compute nodes running. How to set this?

chaoyanghe commented 5 years ago

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute up infinite 18 idle# vm-standard2-24-ad1-[0001-0010],vm-standard2-24-ad2-[0001-0004,0007-0010] compute up infinite 5 idle vm-standard2-24-ad2-[0005-0006],vm-standard2-24-ad3-[0001-0003]

I tried "sudo scontrol update", but it still shows like above info. 18 nodes are always "idle#" Could you help to keep all nodes running?

chaoyanghe commented 5 years ago

@christopheredsall Hi, I have a deadline recently, could you help to check these two issues: 1) 18 nodes are always showing "idle#". They can not be allocated. How to setup all 23 nodes running all the time? Since my experiment needs them to run simultaneously. 2) The "Unable to resolve "mgmt": Unknown host" causes me can't run any job.

christopheredsall commented 5 years ago

Answering the simple question first: how to ensure the nodes stay on. At the moment Slurm will turn off a node that hasn't had a job run on it for SuspendTime seconds. This is currently set to 900 seconds in /mnt/shared/etc/slurm/slurm.conf. For your case you could set it to a much larger number and then run sudo systemctl restart slurmcltd

christopheredsall commented 5 years ago

The Unable to resolve "mgmt": Unknown host error is #23 and I'll answer that in the other issue. This is probably the cause of the currently idle# nodes not being able to run jobs.

chaoyanghe commented 5 years ago

@christopheredsall Thanks. Now the current status: PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute up infinite 18 idle# vm-standard2-24-ad1-[0001-0010],vm-standard2-24-ad2-[0001-0004,0007-0010] compute up infinite 5 idle vm-standard2-24-ad2-[0005-0006],vm-standard2-24-ad3-[0001-0003]

And then I submit my job, the output log also shows like this:

srun: error: Unable to resolve "mgmt": Unknown host srun: error: Unable to establish control machine address srun: error: Unable to confirm allocation for job 92: No error srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 92 slurmstepd: error: Unable to resolve "mgmt": Unknown hos

milliams commented 4 years ago

This has been fixed in the latest versions of CitC. Both in the reporting on the state of the nodes (list_nodes) and the DNS issue causing the problem in the last comment.