Closed ltimmerman3 closed 3 weeks ago
~that's indeed one issue with srun (as opposed to mpirun) that terminating srun processes require using slurm to terminate the step. https://github.com/SPARC-X/SPARC-X-API/blob/c419ce21b87fd8ddc795a5c7034be7a90162a641/sparc/calculator.py#L797 has implemented the termination procedure but there could be more things happening on the actual srun hierachy, I'll take a look~
~we could also implement closing the socket on receiving the EXIT message the C-SPARC side. This may be actually safer to work on, since enumerating all possible combinations of mpi/slurm is tedious.~
To test
@ltimmerman3 It seems my previous response was too complicated.
I may have been mistaken but it seems the following setups works for me with normal srun settings, could you check?
module load gcc mvapich2 openblas fftw netlib-scalapack
make USE_MKL=0 USE_FFTW=1 USE_SCALAPACK=1 USE_SOCKET=1
ASE_SPARC_COMMAND="srun -n 24 --export=ALL path/to/sparc"
python -c "from sparc.calculator import SPARC; print(SPARC(use_socket=True).detect_socket_compatibility())"
since the detect compatibility function only executes whatever sparc command available to the system without an actual -name
suffix https://github.com/alchem0x2A/SPARC-X-API/blob/9136ce832cbfc2fb6519751409721036a9bcacc2/sparc/calculator.py#L967, the subprocess should return regardless of if any socket communication is started. I'm curious how to reproduce the scenario you've observed
@alchem0x2A I ran with the exact settings you provided and it returned just fine. However, recompiling with make USE_MKL=1 USE_FFTW=0 USE_SCALAPACK=0 USE_SOCKET=1
and ml intel-oneapi-compilers intel-oneapi-mpi intel-oneapi-mkl
allowed me to reproduce the error of srun hanging when the run command included srun -n > 1
you're right! this issue stems from the check_inputs
function in SPARC's initialization.c https://github.com/SPARC-X/SPARC/blob/ef868ee6143bad3da9fd84aacb56981ce9ea3801/src/initialization.c#L484, where the check is handled on rank 0 and exit signal is emitted on single rank, that explains the different behavior on different mpi platforms.
I confirm this behavior also exists for non-socket code, simply run srun -n N sparc
without any command suffices will cause rank 0 hang on intel mpi.
I believe the ultimate solution is to always implement MPI_Abort
instead of exit in the original SPARC code, but it could be a lot of work. One safer guardrail on the user-side is to use the --kill-on-bad-exit
option in srun
(https://slurm.schedmd.com/srun.html#OPT_kill-on-bad-exit), one sample ASE_SPARC_COMMAND
would thus be
export ASE_SPARC_COMMAND="srun -n 24 --export=ALL -K sparc"
I've tested that it works on both mvapich2 and intel mpi, could you take a look? If the -K
option can cover most cases then I'll advice to leave the setting to the users, since parsing ASE_SPARC_COMMAND
may not be very straightforward
My recommendation would be to take a hybrid approach. Perhaps rather than relying on the user to figure this out or set the ASE_SPARC_COMMAND, I think we could generate a sparc_env.sh file at runtime or sparc-x-api install time that detects the environment (slurm or not slurm) and writes the appropriate commands, or at least the appropriate ASE_SPARC_COMMAND.
@ltimmerman3 Thx for the suggestions. ASE 3.23 introduces a profile system which is super cool https://wiki.fysik.dtu.dk/ase/ase/calculators/calculators.html#calculator-configuration. That's closer to the approach you're mentioning, and we could clearly show a default template in our doc. In this case ASE_SPARC_COMMAND
, SPARC_DOC_PATH
etc could be saved on a per-file basis. I'll bring it up in another issue
For now let's add a warning in the doc for the user to add the -K
option, but eventually in the v1.1 release (which will be fully support ase 3.23 style) we should move to the profile system
[like] Timmerman, Lucas R reacted to your message:
From: T.Tian @.> Sent: Thursday, September 19, 2024 3:43:40 AM To: SPARC-X/SPARC-X-API @.> Cc: Timmerman, Lucas R @.>; Mention @.> Subject: Re: [SPARC-X/SPARC-X-API] check_socket_compatibility fails with srun on more than one processor (Issue #46)
@ltimmerman3https://github.com/ltimmerman3 Thx for the suggestions. ASE 3.23 introduces a profile system which is super cool https://wiki.fysik.dtu.dk/ase/ase/calculators/calculators.html#calculator-configuration. That's closer to the approach you're mentioning, and we could clearly show a default template in our doc. In this case ASE_SPARC_COMMAND, SPARC_DOC_PATH etc could be saved on a per-file basis. I'll bring it up in another issue
For now let's add a warning in the doc for the user to add the -K option, but eventually in the v1.1 release (which will be fully support ase 3.23 style) we should move to the profile system
— Reply to this email directly, view it on GitHubhttps://github.com/SPARC-X/SPARC-X-API/issues/46#issuecomment-2359914480, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AV2VTTGNDDN3VTQSEG2MCODZXJB6ZAVCNFSM6AAAAABM6ULOTGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJZHEYTINBYGA. You are receiving this because you were mentioned.Message ID: @.***>
@alchem0x2A Quick follow up. I just ran a multi-node job (2 nodes, 48 processors) for initial testing with PLUMED. The actual run time for DFT was 4 seconds, but the job took over an hour. According to the job accounting info, over an hour of time was spent in a "failed" sparc call (which I assume corresponds to the check compatibility call). As such, despite the fact the job eventually ran with the -K option, I don't think this is a viable solution. I "hacked" the calculator file to enforce the -n 1 flag in the detect compatibility function and the code executed as expected. If the update to ase 3.23 is imminent then maybe it's not a priority but something to be aware of.
thx for the updates. It could make sense that the -K switch relies solely on the slurm scheduler to kill extra prcesses. It would make more sense to remove the check compatibility code from the actual socket calculation part and reformat the sparc.quicktest
module for more robust checks. Let's keep this issue open until #48 is done
thx for the updates. It could make sense that the -K switch relies solely on the slurm scheduler to kill extra prcesses. It would make more sense to remove the check compatibility code from the actual socket calculation part and reformat the sparc.quicktest
module for more robust checks. Let's keep this issue open until #48 is done
Updated SPARC check_inputs which rectifies this issue. Needs to be included in SPARC.
Describe the bug Checking for socket compatibility currently requires running
srun path/to/sparc/executable
without the --name input which invokes stdout with/without -socket. This fails when the run command invokes more than one processor as only the 0th task exits, while all others hang.To Reproduce
Expected behavior All processes should exit
Actual output or error trace Only task 0 exits
This can be handled by enforcing
srun -n 1 path/to/sparc
as the run command for the compatibility check. Need to decide how to implement. Simplest: Check if "srun" in command -> edit command to be srun -n 1