Distributed ESMValTool - Githubissues

bouweandela commented 3 years ago

In the IS-ENES3 project, one of the goals is to allow ESMValTool to run distributed across machines attached to multiple different ESGF nodes.

An idea by @nielsdrost to accomplish this would be to split an existing recipe into multiple smaller sub-recipes (containing one or more preprocessing tasks), these parts could then be run remotely and the results combined in a final recipe running the diagnostic tasks. Running remotely could be done conveniently by using a WPS, but initially, we aim for manually running the various sub-recipes from the command line.

This issue is to collect ideas on how this could be implemented. We will need at least the following:

code to split an existing recipe into smaller sub-recipes
code to check that the input of a run from a sub-recipe is identical to what is expected from the main recipe -> simplistic version implemented in #1321 (now checks that entire recipe is the same, may be improved to that preprocessing task is the same)
provenance of preprocessing tasks should be saved to disk in the preprocessing task directories, so it can be loaded for later use -> implemented in #1321
provenance of preprocessing tasks should be loaded from disk when using the data from a run that was done remotely -> implemented in #1321

As part of this distributed compute task, we would also like to implement improved support for intake-esm (#31), if needed augmented with support for OpenDAP access (#1131) and download support using esgf-pyclient (#1130).

senesis commented 3 years ago

We will need at least the following:

* code to split an existing recipe into smaller sub-recipes

* code to check that the ...

And code (+ data) to decide which machine should process which data ....

bouweandela commented 3 years ago

To find out which machine hosts which data, we could use either esgf-pyclient or a collection of intake catalogs, e.g. those hosted at https://github.com/NCAR/intake-esm-datastore/tree/master/catalogs

senesis commented 3 years ago

To find out which machine hosts which data, we could use either esgf-pyclient or a collection of intake catalogs, e.g. those hosted at https://github.com/NCAR/intake-esm-datastore/tree/master/catalogs

Using ESGF infrastructure alone seems more robust. ESFG compute logic would tend to favor sending compute tasks preferentially to those datanodes which host the original version of the data (rather than a duplicate at an index node), but only of course if they expose the compute function. This also allows to reach data which maybe is not duplicated. ANd the fallback solution would be to select a compute node whcih is network-ally close to the datanode

zklaus commented 3 years ago

This seems to vague at the moment to be included in 2.4.0. @bouweandela, are you ok with bumping this to 2.5.0?

bouweandela commented 3 years ago

I think so. It is deliverable 9.3 of the IS-ENES3 project, which is due by the end of the year. Therefore it would have been nice if it would have been ready in time for v2.4, because then it would be in a released version. But I guess they'll just have to accept it as in main and included in the next release. I will try, but it seems unlikely that I'll be able to completely implement this in time.

valeriupredoi commented 2 years ago

hey @bouweandela I'd like to help with this in the new year, man! :beer:

bouweandela commented 2 years ago

This is pretty far done, except for automatic splitting of recipes. Manual splitting is supported since #1264. I will try to do something about automatic splitting for v2.6, but I'm not sure how important this feature is.

bouweandela commented 10 months ago

Closing this issue as it is mostly done. Automatically splitting a recipe based on which machine hosts what data has so far not been requested by any users, so it is probably best to postpone developing such a feature until there is actual demand.

ESMValGroup / ESMValCore

Distributed ESMValTool #1128