SPARC-X / SPARC-X-API

GNU General Public License v3.0
11 stars 10 forks source link

check_socket_compatibility fails with srun on more than one processor #46

Open ltimmerman3 opened 3 weeks ago

ltimmerman3 commented 3 weeks ago

Describe the bug Checking for socket compatibility currently requires running srun path/to/sparc/executable without the --name input which invokes stdout with/without -socket. This fails when the run command invokes more than one processor as only the 0th task exits, while all others hang.

To Reproduce

Expected behavior All processes should exit

Actual output or error trace Only task 0 exits

This can be handled by enforcing srun -n 1 path/to/sparc as the run command for the compatibility check. Need to decide how to implement. Simplest: Check if "srun" in command -> edit command to be srun -n 1

alchem0x2A commented 3 weeks ago

that's indeed one issue with srun (as opposed to mpirun) that terminating srun processes require using slurm to terminate the step. https://github.com/SPARC-X/SPARC-X-API/blob/c419ce21b87fd8ddc795a5c7034be7a90162a641/sparc/calculator.py#L797 has implemented the termination procedure but there could be more things happening on the actual srun hierachy, I'll take a look

we could also implement closing the socket on receiving the EXIT message the C-SPARC side. This may be actually safer to work on, since enumerating all possible combinations of mpi/slurm is tedious.

To test