AntaresSimulatorTeam / AntaREST

API REST and WebUI for Antares_Simulator
Apache License 2.0
10 stars 6 forks source link

Random behavior when launching simulation and server slowness on some studies #1294

Open makdeuneuv opened 1 year ago

makdeuneuv commented 1 year ago

On both environments (PROD and RECETTE) some simulations don't get the statue of done. And some tasks are always on progress while simulation are already done.

Added to this are delays in the launch of certain specific studies. And when the same study is relaunched, it ends in normal time while the previous launch has not ended and has been running for hours.

Contacte me in the chat for details.

create-issue-branch[bot] commented 1 year ago

Branch feature/issue-1294-Random_behavior_when_launching_simulation_and_server_slowness_on_some_studies created!

laurent-laporte-pro commented 1 year ago

Le comportement semble aléatoire, mais il ne doit pas être nouveau. Il est peut-être passé inaperçu jusque là. Je reproduit l'anomalie sur mon environnement de développement et j'ai isolé la portion de code qui pose problème. Je suis encore en train d'analyser. Le traitement de la simulation fonctionne, cependant l'application web ne reçoit pas les notifications.

laurent-laporte-pro commented 1 year ago

New PR fix(api): unexpected behavior when launching simulations

laurent-laporte-pro commented 1 year ago

See: #1230

sylvlecl commented 1 year ago

Some elements of diagnostic about the "unfinished" studies:

It seems that the cause is a failure to retrieve logs (in slurm launcher). The "ls" command executed by SSH to identify log files times out, which can be seen with the following logs:

Command output:
Output: None
Error: Command timed out: ls  /<path>/*<id>*.txt

Then Logs not downloaded.

Later, the "done" status takes into account this log download status, therefore the study is not considered done:

    @staticmethod
    def check_if_study_is_done(study: StudyDTO):
        return study.with_error or (
            study.logs_downloaded
            and study.local_final_zipfile_path
            and study.remote_server_is_clean
            and study.final_zip_extracted
        )

Solution to be decided. The easy quick fix is to increase the timeout, of course, but there is for sure something better to do, the timeout being already 30s !

sylvlecl commented 1 year ago

Further analysis: in that case, the study is left in finished state, but not done, which causes the study counted as done, but not handled as such, because of the following lines. The thread which monitors the status of jobs is then stopped, and therefore we never re-try to retrieve the study data.

https://github.com/AntaresSimulatorTeam/AntaREST/blob/d6eae5342fd1f731bbc60392428a9e0529e1de32/antarest/launcher/adapters/slurm_launcher/slurm_launcher.py#L345-L349

As a quick fix we should probably only rely on the study.done instead of finished, so that we get a chance to download again the study data.

laurent-laporte-pro commented 1 year ago

From the production server, we can see the current ls tasks using the top command, for instance:

96015 run-ant+  20   0  113576   1692   1404 D   0,0  0,0   0:00.05 bash -c ls  /home/run-antares/REMOTE_root_antaresweb/*1521689*.txt                                                                    
96171 run-ant+  20   0  113576   1696   1404 D   0,0  0,0   0:00.12 bash -c ls  /home/run-antares/REMOTE_root_antaresweb/*1528980*.txt                                                                    
96442 run-ant+  20   0  113576   1696   1408 D   0,0  0,0   0:00.02 bash -c ls  /home/run-antares/REMOTE_root_antaresweb/*1526750*.txt                                                                    
97290 run-ant+  20   0  113576   1688   1404 D   0,0  0,0   0:00.02 bash -c ls  /home/run-antares/REMOTE_root_antaresweb/*1521689*.txt                                                                    
97567 run-ant+  20   0  113576   1692   1404 D   0,0  0,0   0:00.02 bash -c ls  /home/run-antares/REMOTE_root_antaresweb/*1526750*.txt                                                                    
98170 run-ant+  20   0  113576   1692   1404 D   0,0  0,0   0:00.02 bash -c ls  /home/run-antares/REMOTE_root_antaresweb/*1526750*.txt                                                                    
99402 run-ant+  20   0  113576   1688   1404 D   0,0  0,0   0:00.00 bash -c ls  /home/run-antares/REMOTE_root_antaresweb/*1521689*.txt