AntaresSimulatorTeam / AntaREST

API REST and WebUI for Antares_Simulator
Apache License 2.0
10 stars 6 forks source link

many ssh connections to slurm, even when the study is finished #2142

Open insatomcat opened 2 weeks ago

insatomcat commented 2 weeks ago

Description

When using a slurm launcher, I can see antarest doing a lot of ssh connections to slurm (about 10-12 / second). They start with the first launch of a study, and never stop unless I shut down the antarest container.

That leads to 2 questions:

If this is specific to my setup, any advice on how I should debug this?

Thanks.

image
MartinBelthle commented 4 days ago

Hello, the amount of SSH connections you do depends on how many worker you launch the app with. Currently every worker has or creates its own tmp file inside the slurm_workspace and each one handles its own studies launch. There's a loop in the code, method _loop inside slurm_launcher.py for the worker to ask slurm the state of the running job. The loop executes itself every 2 seconds and does only one SSH connection (I believe) so I don't really know why you have so much connections.

Also I think that it never stop is a bug as the method stop() inside the same file is supposed to stop the loop.

@sylvlecl if you have an explanation feel free