ESMValGroup / ESMValCore

ESMValCore: A community tool for pre-processing data from Earth system models in CMIP and running analysis scripts.
https://www.esmvaltool.org
Apache License 2.0
42 stars 38 forks source link

ESMValTool can be wasteful with computational resources - so we should link to previous runs #1271

Open ledm opened 3 years ago

ledm commented 3 years ago

I'm working on a large and expensive recipe right now. This recipe make a an ensemble of the SST for every model, every ensemble member every scenarioMIP analysis.

It's great to have ESMValTool to do this as it's very little work from me to get this done. However, it's a ton of work for ESMValTool to perform these calculations. In the most recent run, I got through about a quarter of the recipe in 24 hours before jasmin kicked me off.

While today, the job ended after 24 hours due to scheduled kill command, there could be many reasons for the preprocessor to end early. SSH disconnect, memory issues, data issues, a fault in something hidden, system restart, could be anything really!

This kind of premature recipe end is frustrating, but also really wasteful. I don't want to throw away 24 hours of CPU time and re-run the recipe.

As we all know, when these jobs end early, it's very hard to recover the preprocessed data. @valeriupredoi advised me to "use those in a diag rerun, and just run the preproc for the ones it didnt finish (comment out models that finished)". This is fine, but it is quite fiddly!

For times when both the recipe and the esmvalcore git repository do not change between runs, a flag to re-use data from a given previous incomplete run location.

ie:

esmvaltool recipe_name.yml --path-to-preproc /gws/nopw/j04/ukesm/ldemora/ESMValTool_output/recipe_name_20210811_140833

I envisage this working like this:

I suspect that this has been suggested before, but this kind of performance will be crucial for reducing our energy/CPU usage, but also a huge step towards making ESMValTool compatible with monitoring on-going model runs.

bouweandela commented 3 years ago

@ledm All very good points, but please note that some of these problems are fairly easy to avoid by making better use of existing tools:

ledm commented 3 years ago

In case anyone in interested, this recipe completed now, took a month human time, about 20-25 iterations of which 6 had useful data.

zklaus commented 3 years ago

@ledm, checkout #1321. I think that makes some steps in the direction you want to go.

ledm commented 3 years ago

@ledm, checkout #1321. I think that makes some steps in the direction you want to go.

This is definitely encouraging and I applaud the effort! Thanks for sharing!