LLNL / ATS

ATS - Automated Testing System - is an open-source, Python-based tool for automating the running of tests of an application across a broad range of high performance computers.
BSD 3-Clause "New" or "Revised" License
6 stars 5 forks source link

Salloc + Slurm on Toss 4 (rzwhippet) Fails #133

Closed dawson6 closed 11 months ago

dawson6 commented 1 year ago

On Toss 3, one could run like so:

salloc -N 3 -p pdebug --exclusive srun -n 1

And that will run the atswrapper on 1 of the allocated nodes, which would then run 'srun -n 1' commands on that node to submit all the jobs.

The benefit of this is that, while 'atswrapper' is not an MPI application, it prevents the followup srun jobs, submitted by atswrapper, from running on the login node.

This works on toss3.

But on toss4 (rzwhippet) the followup srun jobs all fail with:

srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000000000003000000000000070000000000000300000000000007. srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000000000003000000000000070000000000000300000000000007. srun: error: Task launch for StepId=1932.2 failed on node rzwhippet40: Unable to satisfy cpu bind request srun: error: Task launch for StepId=1932.2 failed on node rzwhippet41: Unable to satisfy cpu bind request srun: error: Application launch failed: Unable to satisfy cpu bind request

Now, one can run like so

salloc -N 3 -p pdebug --exclusive , and while that runs, it does the 'srun's on the login node, which looks bad.

OR 1 can run by splitting that iinto two steps

1) salloc the node somehow 2) run atswrapper

But combinging the salloc ... srun into 1 line h as issues now, it did not with toss3.

ryanday36 commented 1 year ago

This seems to be a pretty clear bug in how Slurm is setting up binding when you specify cpus-per-task (rzwhippet is running a newer version of Slurm than rzgenie). I'll report it to the Slurm developers so that we can get it fixed.

It looks like you can work around this by setting any valid '--cpu-bind' argument. E.g. here I set SLURM_CPU_BIND=quiet (which is the default cpu-bind behavior), and it rescues my simple reproducer.

[day36@rzwhippet17:salloc_test]$ cat runstuff.sh 
#!/bin/sh
echo "#works"
SLURM_CPU_BIND=quiet srun --mpibind=off --nodes=1 --ntasks=1 --cpus-per-task=1 hostname
echo ""
echo "#fails"
srun --mpibind=off --nodes=1 --ntasks=1 --cpus-per-task=1 hostname

[day36@rzwhippet17:salloc_test]$ salloc -N2 --exclusive srun -N1 -n1 ./runstuff.sh 
salloc: Granted job allocation 2365
salloc: Waiting for resource configuration
salloc: Nodes rzwhippet[16,22] are ready for job
#works
rzwhippet22

#fails
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000000000000000000000000010000000000000000000000000001.
srun: error: Task launch for StepId=2365.2 failed on node rzwhippet22: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted
srun: error: rzwhippet16: task 0: Exited with exit code 192
salloc: Relinquishing job allocation 2365
[day36@rzwhippet17:salloc_test]$
dawson6 commented 11 months ago

Resolved, just have to teach folks to use --interactive after srun

salloc -N --exclusive srun --interactive -n 1