ESMValGroup / ESMValTool

ESMValTool: A community diagnostic and performance metrics tool for routine evaluation of Earth system models in CMIP
https://www.esmvaltool.org
Apache License 2.0
228 stars 128 forks source link

Add timeout to running a recipe with ESMValBot #2059

Open bouweandela opened 3 years ago

bouweandela commented 3 years ago

Running the recipe examples/recipe_ncl.yml has caused the bot to take forever on a few occasions so far, see e.g. https://github.com/ESMValGroup/ESMValTool/pull/2046#issuecomment-782190972 (has been running for a week now), because it tries to run the NCL diagnostic and NCL just hangs, here is the relevant output of ps jfx:

 PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
...
19564 20036 20043 20043 ?           -1 Sl     496 178:44          |           \_ /mnt/esmvaltool_disk1/esmvalbot-work/esmvalbotwork-release_2.2.0-klcipgso/conda/envs/esmvaltool/bin/python /mnt/esmvaltool_disk1/esmvalbot-work/esmvalbotwork-release_2.2.0-klcipgso/conda/envs/esmvaltool/bin/esmvaltool run examples/recipe_ncl.yml --output_dir /mnt/esmvaltool_disk1/shared/esmvaltool/esmvalbot-output/release_2.2.0-klcipgso
20036 20071 20043 20043 ?           -1 S      496   0:01          |               \_ /mnt/esmvaltool_disk1/esmvalbot-work/esmvalbotwork-release_2.2.0-klcipgso/conda/envs/esmvaltool/bin/ncl -n -p /mnt/esmvaltool_disk1/esmvalbot-work/esmvalbotwork-release_2.2.0-klcipgso/ESMValTool/esmvaltool/diag_scripts/examples/diagnostic.ncl

To avoid this situation, it would be good to impose a maximum duration of a recipe run with the bot. For example a few hours or a day.

bettina-gier commented 3 years ago

I've actually encountered this also with regular batch jobs - The job says it's been running for 7 hours but looking at the output there was an ncl error 15 minutes in. Not sure if that's an issue with ncl to propagate the error to shut down the process or how it works together with the esmvaltool. Not seen a pattern in the errors when this occurs and when it gets shut down properly sadly.

bouweandela commented 3 years ago

Maybe the problem is bigger than just on the machine that the bot is running on then. The last few runs of NCL recipes on the bot machine all failed.

fserva commented 3 years ago

Hi @bouweandela in #2230 I am trying to check the latest pushed changes with the bot, however in the last two attempts tagging it did not work at all (no bot reply).

Do you know if there is a problem (e.g. disk space or queues) on the server the bot is running on? Not sure if the processes timed out for some reason. Thanks

bouweandela commented 3 years ago

It looks like the machine ran out of disk space, will try to clean up a bit.