SPARC-X / SPARC-X-API

GNU General Public License v3.0
11 stars 10 forks source link

check_socket_compatibility fails with srun on more than one processor #46

Closed ltimmerman3 closed 3 weeks ago

ltimmerman3 commented 2 months ago

Describe the bug Checking for socket compatibility currently requires running srun path/to/sparc/executable without the --name input which invokes stdout with/without -socket. This fails when the run command invokes more than one processor as only the 0th task exits, while all others hang.

To Reproduce

Expected behavior All processes should exit

Actual output or error trace Only task 0 exits

This can be handled by enforcing srun -n 1 path/to/sparc as the run command for the compatibility check. Need to decide how to implement. Simplest: Check if "srun" in command -> edit command to be srun -n 1

alchem0x2A commented 2 months ago

~that's indeed one issue with srun (as opposed to mpirun) that terminating srun processes require using slurm to terminate the step. https://github.com/SPARC-X/SPARC-X-API/blob/c419ce21b87fd8ddc795a5c7034be7a90162a641/sparc/calculator.py#L797 has implemented the termination procedure but there could be more things happening on the actual srun hierachy, I'll take a look~

~we could also implement closing the socket on receiving the EXIT message the C-SPARC side. This may be actually safer to work on, since enumerating all possible combinations of mpi/slurm is tedious.~

To test

alchem0x2A commented 2 months ago

@ltimmerman3 It seems my previous response was too complicated.

I may have been mistaken but it seems the following setups works for me with normal srun settings, could you check?

since the detect compatibility function only executes whatever sparc command available to the system without an actual -name suffix https://github.com/alchem0x2A/SPARC-X-API/blob/9136ce832cbfc2fb6519751409721036a9bcacc2/sparc/calculator.py#L967, the subprocess should return regardless of if any socket communication is started. I'm curious how to reproduce the scenario you've observed

ltimmerman3 commented 2 months ago

@alchem0x2A I ran with the exact settings you provided and it returned just fine. However, recompiling with make USE_MKL=1 USE_FFTW=0 USE_SCALAPACK=0 USE_SOCKET=1 and ml intel-oneapi-compilers intel-oneapi-mpi intel-oneapi-mkl allowed me to reproduce the error of srun hanging when the run command included srun -n > 1

alchem0x2A commented 2 months ago

you're right! this issue stems from the check_inputs function in SPARC's initialization.c https://github.com/SPARC-X/SPARC/blob/ef868ee6143bad3da9fd84aacb56981ce9ea3801/src/initialization.c#L484, where the check is handled on rank 0 and exit signal is emitted on single rank, that explains the different behavior on different mpi platforms.

I confirm this behavior also exists for non-socket code, simply run srun -n N sparc without any command suffices will cause rank 0 hang on intel mpi.

I believe the ultimate solution is to always implement MPI_Abort instead of exit in the original SPARC code, but it could be a lot of work. One safer guardrail on the user-side is to use the --kill-on-bad-exit option in srun (https://slurm.schedmd.com/srun.html#OPT_kill-on-bad-exit), one sample ASE_SPARC_COMMAND would thus be

export ASE_SPARC_COMMAND="srun -n 24 --export=ALL -K sparc"

I've tested that it works on both mvapich2 and intel mpi, could you take a look? If the -K option can cover most cases then I'll advice to leave the setting to the users, since parsing ASE_SPARC_COMMAND may not be very straightforward

ltimmerman3 commented 2 months ago

My recommendation would be to take a hybrid approach. Perhaps rather than relying on the user to figure this out or set the ASE_SPARC_COMMAND, I think we could generate a sparc_env.sh file at runtime or sparc-x-api install time that detects the environment (slurm or not slurm) and writes the appropriate commands, or at least the appropriate ASE_SPARC_COMMAND.

alchem0x2A commented 2 months ago

@ltimmerman3 Thx for the suggestions. ASE 3.23 introduces a profile system which is super cool https://wiki.fysik.dtu.dk/ase/ase/calculators/calculators.html#calculator-configuration. That's closer to the approach you're mentioning, and we could clearly show a default template in our doc. In this case ASE_SPARC_COMMAND, SPARC_DOC_PATH etc could be saved on a per-file basis. I'll bring it up in another issue

For now let's add a warning in the doc for the user to add the -K option, but eventually in the v1.1 release (which will be fully support ase 3.23 style) we should move to the profile system

ltimmerman3 commented 2 months ago

[like] Timmerman, Lucas R reacted to your message:


From: T.Tian @.> Sent: Thursday, September 19, 2024 3:43:40 AM To: SPARC-X/SPARC-X-API @.> Cc: Timmerman, Lucas R @.>; Mention @.> Subject: Re: [SPARC-X/SPARC-X-API] check_socket_compatibility fails with srun on more than one processor (Issue #46)

@ltimmerman3https://github.com/ltimmerman3 Thx for the suggestions. ASE 3.23 introduces a profile system which is super cool https://wiki.fysik.dtu.dk/ase/ase/calculators/calculators.html#calculator-configuration. That's closer to the approach you're mentioning, and we could clearly show a default template in our doc. In this case ASE_SPARC_COMMAND, SPARC_DOC_PATH etc could be saved on a per-file basis. I'll bring it up in another issue

For now let's add a warning in the doc for the user to add the -K option, but eventually in the v1.1 release (which will be fully support ase 3.23 style) we should move to the profile system

— Reply to this email directly, view it on GitHubhttps://github.com/SPARC-X/SPARC-X-API/issues/46#issuecomment-2359914480, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AV2VTTGNDDN3VTQSEG2MCODZXJB6ZAVCNFSM6AAAAABM6ULOTGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJZHEYTINBYGA. You are receiving this because you were mentioned.Message ID: @.***>

ltimmerman3 commented 1 month ago

@alchem0x2A Quick follow up. I just ran a multi-node job (2 nodes, 48 processors) for initial testing with PLUMED. The actual run time for DFT was 4 seconds, but the job took over an hour. According to the job accounting info, over an hour of time was spent in a "failed" sparc call (which I assume corresponds to the check compatibility call). As such, despite the fact the job eventually ran with the -K option, I don't think this is a viable solution. I "hacked" the calculator file to enforce the -n 1 flag in the detect compatibility function and the code executed as expected. If the update to ase 3.23 is imminent then maybe it's not a priority but something to be aware of.

alchem0x2A commented 1 month ago

thx for the updates. It could make sense that the -K switch relies solely on the slurm scheduler to kill extra prcesses. It would make more sense to remove the check compatibility code from the actual socket calculation part and reformat the sparc.quicktest module for more robust checks. Let's keep this issue open until #48 is done

alchem0x2A commented 1 month ago

thx for the updates. It could make sense that the -K switch relies solely on the slurm scheduler to kill extra prcesses. It would make more sense to remove the check compatibility code from the actual socket calculation part and reformat the sparc.quicktest module for more robust checks. Let's keep this issue open until #48 is done

ltimmerman3 commented 3 weeks ago

Updated SPARC check_inputs which rectifies this issue. Needs to be included in SPARC.