Closed dawson6 closed 11 months ago
This seems to be a pretty clear bug in how Slurm is setting up binding when you specify cpus-per-task (rzwhippet is running a newer version of Slurm than rzgenie). I'll report it to the Slurm developers so that we can get it fixed.
It looks like you can work around this by setting any valid '--cpu-bind' argument. E.g. here I set SLURM_CPU_BIND=quiet (which is the default cpu-bind behavior), and it rescues my simple reproducer.
[day36@rzwhippet17:salloc_test]$ cat runstuff.sh
#!/bin/sh
echo "#works"
SLURM_CPU_BIND=quiet srun --mpibind=off --nodes=1 --ntasks=1 --cpus-per-task=1 hostname
echo ""
echo "#fails"
srun --mpibind=off --nodes=1 --ntasks=1 --cpus-per-task=1 hostname
[day36@rzwhippet17:salloc_test]$ salloc -N2 --exclusive srun -N1 -n1 ./runstuff.sh
salloc: Granted job allocation 2365
salloc: Waiting for resource configuration
salloc: Nodes rzwhippet[16,22] are ready for job
#works
rzwhippet22
#fails
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000000000000000000000000010000000000000000000000000001.
srun: error: Task launch for StepId=2365.2 failed on node rzwhippet22: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted
srun: error: rzwhippet16: task 0: Exited with exit code 192
salloc: Relinquishing job allocation 2365
[day36@rzwhippet17:salloc_test]$
Resolved, just have to teach folks to use --interactive after srun
salloc -N
On Toss 3, one could run like so:
salloc -N 3 -p pdebug --exclusive srun -n 1
And that will run the atswrapper on 1 of the allocated nodes, which would then run 'srun -n 1' commands on that node to submit all the jobs.
The benefit of this is that, while 'atswrapper' is not an MPI application, it prevents the followup srun jobs, submitted by atswrapper, from running on the login node.
This works on toss3.
But on toss4 (rzwhippet) the followup srun jobs all fail with:
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000000000003000000000000070000000000000300000000000007. srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000000000003000000000000070000000000000300000000000007. srun: error: Task launch for StepId=1932.2 failed on node rzwhippet40: Unable to satisfy cpu bind request srun: error: Task launch for StepId=1932.2 failed on node rzwhippet41: Unable to satisfy cpu bind request srun: error: Application launch failed: Unable to satisfy cpu bind request
Now, one can run like so
salloc -N 3 -p pdebug --exclusive, and while that runs, it does the 'srun's on the login node, which looks bad.
OR 1 can run by splitting that iinto two steps
1) salloc the node somehow 2) run atswrapper
But combinging the salloc ... srun into 1 line h as issues now, it did not with toss3.