OSC / osc-systemstatus

MIT License
3 stars 2 forks source link

Investigate if Slurm has commands that produce similar output for the Moab commands #56

Closed ericfranz closed 4 years ago

ericfranz commented 6 years ago

Skim the documentation at https://slurm.schedmd.com/ and see if there are commands that might provide the same information that is provided by MoabShowqClient (https://github.com/AweSim-OSC/osc-systemstatus/blob/488b0656c6dd569d046cc681481641e7c1fade68/lib/moab_showq_client.rb). If so we could probably spin up a simple Slurm version of the system status app.

treydock commented 6 years ago

@ericfranz What information are you trying to get out of slurm? What does MoabShowqClient provide that you would like to get out of SLURM?

treydock commented 6 years ago

If my read of code is correct you want general queue statistics as well as node info. The node info is from sinfo -N. A few examples:

Allocated/Idle nodes:

$ sinfo -o '%A'
NODES(A/I)
110/207

Nodes Allocated/Idle/Other/Total:

$ sinfo -o '%F'
NODES(A/I/O/T)
97/220/13/330

Mainly look at sinfo. The information about job statistics could come from parsing out squeue or possibly partition info from sinfo if you iterate over all partitions.

ericfranz commented 6 years ago

See #43 for a discussion on the value of this feature.

KinanAlAttar commented 4 years ago
The table below provides equivalent Slurm cmds for each respective Moab/Torque cmd that we use: (In systemstatus, as far as I know, we only use pbsnodes and showq) I'll add more cmd equivalences for the sake of completeness. Moab/Torque Slurm
showq squeue
pbsnodes scontrol show node
qstat squeue -j or scontrol show job
qhold scontrol hold job
qrls scontrol release job
qsub sbatch/srun/salloc
xpbs sview
qalter scontrol update
qdel scancel
showstart squeue -o "%S" or squeue --start

For qpeek, Slurm updates the out/err files provided in real time, so there's no need for an equivalent cmd.

treydock commented 4 years ago

If you want to mimic node status from pbsnodes use sinfo. I use sinfo in Prometheus to collect GPU usage: https://github.com/treydock/prometheus-slurm-exporter/blob/osc/gpus.go#L114-L115

You might be able to get everything from sinfo if you are looking at node data. Don't use scontrol because you can not control the formatting of scontrol and thus it's not good for scripts to ingest. Commands like sinfo can have their formatting controlled.

For tracking information about running jobs, you only need to use squeue and format flags to look at TRES fields. Like tres-alloc which would contain like gres/gpu=2 for a job containing GPUs, but it could also look like gres/gpu:v100=2 if a user asks for specific type of GPU.

ericfranz commented 4 years ago

https://github.com/OSC/osc-systemstatus/pull/86