argonne-lcf / user-guides

ALCF Systems User Documentation
https://docs.alcf.anl.gov/
20 stars 29 forks source link

Aurora job allocation doc: qsub examples to select nodes within slot/chassis/cabinet etc? examples of useful pbs qstat options #527

Open kaushikvelusamy opened 2 weeks ago

kaushikvelusamy commented 2 weeks ago

Providing my notes on some items to cover

1. Allocating Nodes in Specific Racks or Cabinets

Selecting Nodes in Specific Cabinets or Chassis:

Requesting a Single Node per Specified Cabinet:

Determining the Number of Available Nodes:

2. Advanced Node Selection Syntax

Example of Selecting a Specific Number of Cabinets:

3. Limitations and Considerations

4. Additional Useful Commands

Checking Nodes in Chassis:

pbsnodes -avSj | awk '{ if ($2 = "free" ) print $1 "\t" $2 }'

you can select specific nodes via 

-l select=host=x4703c2s3b0n0 qsub -l select=host=x1922c6s3b0n0+1:host=x1922c7s6b0n0 -q workq-route -l walltime=00:20:00 -l filesystems=gila -A Aurora_deployment -I

watch qstat -was1 workq to see what is queued up and about to begin in my queue

qstat -Twas1 lustre_scaling | column -t | sort -k 6 -n

to order by running and then waiting qstat -Twas1 lustre_scaling

to sort by num nodes | column -t | sort -k 6 -n

$ qstat -fxw Check for comment field. run_count if its increasing then its trying to offline nodes and bringing in new nodes

qstat -xwpau $USER to show a list of recently submitted jobs and you can see the Elap Time vs Req'd Time

Nodes can have more than one status (down,offline is pretty common for instance), PBS will only show the first on the list in a summary view like res pbsnodes -avSj

you should also keep in mind that node statuses will matter, pbsnodes -avSj

and

pbsnodes -l

will help a lot with that (the first shows job id with nodes and their status, the second shows nodes that are considered 'down' and are in a unusuable state.

so

qstat -was1 workq

will get you that info for workq. Also

qstat -Qf workq

will show full details on the queue, and the resources_assigned.nodect entry will have how many nodes have jobs running on them

qstat -fx 8997637.amn-0001

pbs_rstat - show reservation

kaushikvelusamy commented 2 weeks ago

reference to clush and pdsh will be helpful to new users