Open makdeuneuv opened 1 year ago
Le comportement semble aléatoire, mais il ne doit pas être nouveau. Il est peut-être passé inaperçu jusque là. Je reproduit l'anomalie sur mon environnement de développement et j'ai isolé la portion de code qui pose problème. Je suis encore en train d'analyser. Le traitement de la simulation fonctionne, cependant l'application web ne reçoit pas les notifications.
See: #1230
Some elements of diagnostic about the "unfinished" studies:
It seems that the cause is a failure to retrieve logs (in slurm launcher). The "ls" command executed by SSH to identify log files times out, which can be seen with the following logs:
Command output:
Output: None
Error: Command timed out: ls /<path>/*<id>*.txt
Then Logs not downloaded
.
Later, the "done" status takes into account this log download status, therefore the study is not considered done:
@staticmethod
def check_if_study_is_done(study: StudyDTO):
return study.with_error or (
study.logs_downloaded
and study.local_final_zipfile_path
and study.remote_server_is_clean
and study.final_zip_extracted
)
Solution to be decided. The easy quick fix is to increase the timeout, of course, but there is for sure something better to do, the timeout being already 30s !
Further analysis:
in that case, the study is left in finished
state, but not done
, which causes the study counted as done, but not handled as such, because of the following lines. The thread which monitors the status of jobs is then stopped, and therefore we never re-try to retrieve the study data.
As a quick fix we should probably only rely on the study.done
instead of finished
, so that we get a chance to download again the study data.
From the production server, we can see the current ls
tasks using the top
command, for instance:
96015 run-ant+ 20 0 113576 1692 1404 D 0,0 0,0 0:00.05 bash -c ls /home/run-antares/REMOTE_root_antaresweb/*1521689*.txt
96171 run-ant+ 20 0 113576 1696 1404 D 0,0 0,0 0:00.12 bash -c ls /home/run-antares/REMOTE_root_antaresweb/*1528980*.txt
96442 run-ant+ 20 0 113576 1696 1408 D 0,0 0,0 0:00.02 bash -c ls /home/run-antares/REMOTE_root_antaresweb/*1526750*.txt
97290 run-ant+ 20 0 113576 1688 1404 D 0,0 0,0 0:00.02 bash -c ls /home/run-antares/REMOTE_root_antaresweb/*1521689*.txt
97567 run-ant+ 20 0 113576 1692 1404 D 0,0 0,0 0:00.02 bash -c ls /home/run-antares/REMOTE_root_antaresweb/*1526750*.txt
98170 run-ant+ 20 0 113576 1692 1404 D 0,0 0,0 0:00.02 bash -c ls /home/run-antares/REMOTE_root_antaresweb/*1526750*.txt
99402 run-ant+ 20 0 113576 1688 1404 D 0,0 0,0 0:00.00 bash -c ls /home/run-antares/REMOTE_root_antaresweb/*1521689*.txt
On both environments (PROD and RECETTE) some simulations don't get the statue of done. And some tasks are always on progress while simulation are already done.
Added to this are delays in the launch of certain specific studies. And when the same study is relaunched, it ends in normal time while the previous launch has not ended and has been running for hours.
Contacte me in the chat for details.