slow dataload for input files with multiple variables

senesis commented 3 years ago

One of the DRS for the model data I use provides multi-variable files, which I accommodate using dedicated fix_metadata functions which filter all cubes but the interesting one

However the data load time for such files is very large : for a file of size 260 Mo and with ~ 300 variables covering a period of one month, it amounts to around 5 minutes, while in the case of a single variable file of size 100 Mo and covering a period of 164 years it amounts to 0.1 seconds; (the latter case is for another available DRS, which does not provide the same set of variables)

Does anybody have experience about how this data load step could be speed up ?

bsolino commented 3 years ago

I'm encountering the same issue.

My first suspecion is that the long loading times come from Iris, which is loading all the variables in the cubes. Iris allows to load only selected variables, which speeds up the process. Doing iris.load(files, constrain) on a Jupyter Notebook I've had the following results:

12 files, constrain = None
Time: 0:05:01.485681
732 cubes found

12 files, constrain = "tas"
Time: 0:01:53.310646
12 cubes found

12 files, constrain = ["tas", "pr"]
Time: 0:01:52.531757
24 cubes found

However, an ESMValTool run that loads a variable is taking a slightly shorter amount of time than loading one variable, while (apparently) loading all the variables.

Time for running the recipe was: 0:01:35.745669

Another issue is that if a recipe includes more variables that are located in the same files, those files are loaded again for each variable. For example, with 2 variables the time in the recipe doubles:

Time for running the recipe was: 0:02:49.668044

However, I'm not sure if it would be possible to change how that works. If I understood correctly, each variable is treated on a different task that are run in parallel, so sharing information among them may imply high structural changes.

senesis commented 3 years ago

The load times I quoted are those observed during an ESMValTool load, so I do not reach the same conclusion that loading multiple variables is less expensive than loading one (this letting aside the question of multiple load of the same set of variables). Does the 'time for running the recipe' really include the load time ?

Because, reversely, I suspected that loading mutiple variables was the issue (which sound sensible), I event tested to add a hard-coded Constraint in the iris.load_raw call in _io.load(). This did not change the load time

bsolino commented 3 years ago

Sorry, I may not have been very clear. What I was trying to say is that loading all the variables was indeed more expensive than asking Iris to constrain which variables to load. However, the ESMValTool was somehow faster than my tests with Iris and as far as I know it is not constraining the load. Note that those tests were with lazy data, just obtaining the cubes.

Another issue with multiple variables is that if there are many variables on the recipe, the files are loaded once for each variable. In my case, with two variables the time approximately doubled. I'm not sure if it's possible to isolate the loading time, that's why I included the total time for running the recipe. As far as I know, that time is measured from execution until completion.

valeriupredoi commented 3 years ago

cheers for raising the issue @senesis :+1: The load mechanism uses iris.load_raw since we want to load the file withoout any merging/concatenation performed in iris that, quite frequently, may fail - to this end we apply callback function that reduces some cube attributes and fixes some coordinates at load point so that we can perform the in-house developed concatenation (that is custom and takes care of a whole lot of corner cases like overlapping or incomplete data etc). load_raw indeed loads all variables inside a netCDF file into separate iris cubes but this is desired to a certain extent since we want, in exceptional cases, some cell measures or other auxiliary variables to be loaded. Note that the slowness from loading a raw file containing a large number of variables is not something we accounted for at design level, back when we made the design of the software - this is not something we'd face since working with CMOR-standard files means one variable per file. The specifications of the entire load procedure and its API inside the code would have to change to extract a single variable out the file and that would pose some issues too - some files need to have standard_names fixed before loading and extracting. My take on this would be to have the data in CMOR-format before running the tool ie perform variable extraction, CMORization and saving just the variable you need in a file, would that be possible?

valeriupredoi commented 3 years ago

@bsolino -

Another issue with multiple variables is that if there are many variables on the recipe, the files are loaded once for each variable. In my case, with two variables the time approximately doubled. I'm not sure if it's possible to isolate the loading time, that's why I included the total time for running the recipe. As far as I know, that time is measured from execution until completion.

That's because each variable and the ops performed on it are treated as an independent set of tasks, there is no communication between variables within one diagnostic - and if you use parallel tasks you will see why that is happening - it's much faster this way, implementing a set of mpi tasks that communicate with each other would be nice (and that will allow for data recycling) but that's a tad too complex a refactor for this version - maybe v3.0?

valeriupredoi commented 3 years ago

@senesis -

Because, reversely, I suspected that loading mutiple variables was the issue (which sound sensible), I event tested to add a hard-coded Constraint in the iris.load_raw call in _io.load(). This did not change the load time

Providing a constraint to the loader (whether it be load or load_raw or any other wrapper of iris.io.load_files() ) does not speed things up since the core loading function loads all the contents of a file first then extracts the constrained variable/metadata - if it's slow it means there are lots of stuffs inside the file :grin:

senesis commented 3 years ago

@valeriupredoi

My take on this would be to have the data in CMOR-format before running the tool ie perform variable extraction, CMORization and saving just the variable you need in a file, would that be possible?

One of the obvious requirements for fully using ESMValTool for model development is that it complies with in-house data formats, and what I discovered so far is that most of its design actually allows for that, as long as one makes full use of the (meta)data fix mechanism, and configuration features, and one allows for two fixes in ESMValCore (see #1083 and #1086)

Regarding the use of multiple variable files, the functional requirement is met, but not the performance requirement (see load time figures above).

If I understand well, this is due to this deadlock :

standard iris.load function is fast for partial load but does not match ESMvalCore requirements
iris.load_raw function used with a callback cannot yet handle partial load (or providing arguments to this callback is yet tricky)

The level of priority (and resources) for solving this issue should be gauged against the interest in reaching the model developers target, or at least the part of that community which runs a model with non-CMOR native output.

This would increase the use of ESMValTool within model development centers and will also be a step towards bringing in additional metrics/diagnostics from other tools which are currently used for model development as these other tools would then be merged with / replaced by ESMValTool (sic)

I could devote some working time to that topic.

bsolino commented 3 years ago

@valeriupredoi

That's because each variable and the ops performed on it are treated as an independent set of tasks, there is no communication between variables within one diagnostic - and if you use parallel tasks you will see why that is happening - it's much faster this way, implementing a set of mpi tasks that communicate with each other would be nice (and that will allow for data recycling) but that's a tad too complex a refactor for this version - maybe v3.0?

Yes, I understand that it's very complex, and it seems it's an uncommon issue, as it requires a specific type of dataset and a recipe performing multiple tasks on such a dataset. My naïve solution would be to have a common cache for all the tasks, but that introduces its own set of issues once the tasks start working with them.

bouweandela commented 3 years ago

Maybe the slow load times for files with many variables could be reported with iris? This sounds like an iris issue.

senesis commented 3 years ago

Maybe the slow load times for files with many variables could be reported with iris? This sounds like an iris issue.

Iris slow load issue is documented and reported here

senesis commented 3 years ago

Update : Iris 3.0.2 should improve speed on this case, albeit only by a factor of 10 which may be still slow.

senesis commented 3 years ago

@jservonnat Update : because even Iris 3.0.2 will be still quite slow, I managed to select the desired variable using an external process upstream of Iris load (by a fix_file method). CDO,selvar is very fast at selecting variables.

ESMValGroup / ESMValTool

slow dataload for input files with multiple variables #2141