Open bouweandela opened 3 years ago
I've actually encountered this also with regular batch jobs - The job says it's been running for 7 hours but looking at the output there was an ncl error 15 minutes in. Not sure if that's an issue with ncl to propagate the error to shut down the process or how it works together with the esmvaltool. Not seen a pattern in the errors when this occurs and when it gets shut down properly sadly.
Maybe the problem is bigger than just on the machine that the bot is running on then. The last few runs of NCL recipes on the bot machine all failed.
Hi @bouweandela in #2230 I am trying to check the latest pushed changes with the bot, however in the last two attempts tagging it did not work at all (no bot reply).
Do you know if there is a problem (e.g. disk space or queues) on the server the bot is running on? Not sure if the processes timed out for some reason. Thanks
It looks like the machine ran out of disk space, will try to clean up a bit.
Running the recipe
examples/recipe_ncl.yml
has caused the bot to take forever on a few occasions so far, see e.g. https://github.com/ESMValGroup/ESMValTool/pull/2046#issuecomment-782190972 (has been running for a week now), because it tries to run the NCL diagnostic and NCL just hangs, here is the relevant output ofps jfx
:To avoid this situation, it would be good to impose a maximum duration of a recipe run with the bot. For example a few hours or a day.