Open ltimmerman3 opened 3 weeks ago
that's indeed one issue with srun (as opposed to mpirun) that terminating srun processes require using slurm to terminate the step. https://github.com/SPARC-X/SPARC-X-API/blob/c419ce21b87fd8ddc795a5c7034be7a90162a641/sparc/calculator.py#L797 has implemented the termination procedure but there could be more things happening on the actual srun hierachy, I'll take a look
we could also implement closing the socket on receiving the EXIT message the C-SPARC side. This may be actually safer to work on, since enumerating all possible combinations of mpi/slurm is tedious.
To test
Describe the bug Checking for socket compatibility currently requires running
srun path/to/sparc/executable
without the --name input which invokes stdout with/without -socket. This fails when the run command invokes more than one processor as only the 0th task exits, while all others hang.To Reproduce
Expected behavior All processes should exit
Actual output or error trace Only task 0 exits
This can be handled by enforcing
srun -n 1 path/to/sparc
as the run command for the compatibility check. Need to decide how to implement. Simplest: Check if "srun" in command -> edit command to be srun -n 1