Remove use of CDO for selecting the right variable in native model fixes

zklaus commented 3 years ago

Out of performance concerns about Iris' loading of individual variables from multi-variable files, the IPSLCM support as implemented in the corresponding fix file, makes optional use of CDO via a pipe interface.

Ideally, we would achieve acceptable performance with Iris alone and remove CDO from ESMValCore. If that turns out to be impossible, we might rewrite the use of CDO with their Python bindings to address concerns about the use of the shell type interface.

This issue is precipitated by a discussion in the PR that introduced IPSLCM support to ESMValCore, #1153.

zklaus commented 3 years ago

@pp-mo wrote:

Hi @senesis @zklaus I've been listening in here, but it's not really my area to comment. Except that, I just wanted to remind you that maybe the ideas at our https://github.com/SciTools/iris/pull/4176 might make the use of CDO unnecessary here ?

I was prompted to investiagte that approach by @abooton, who I think discussed it with @zklaus . I think we will be adopting something like this for our own purposes, but it's not quite clear to me which type of input (constraints) it must support, to be of use in your workflows. So we'd really appreciate some feedback on how it could fit in with your needs here. 😄

zklaus commented 3 years ago

@senesis replied:

Hi @pp-mo, @zklaus , @abooton

Sorry for this late reply.

I am not sure I fully understand everything in SciTools/iris#4176.

Assuming that the question still is "would a simple (and fast) load of the single targeted variable fit the need ?" , I am afraid the answer is no, because the variables which are auxiliary coordinates for the targeted variables should be loaded too, as they might be useful at a later stage (and by the way, I agree that using CDO for selecting the variable is then not correct either).

Further, @valeriupredoi wrote in #ESMValTool/2141 that the use of iris.load_raw is a strong design principle for ESMValTool, and, if I understand well, this seems to complicate finding a working solution.

zklaus commented 3 years ago

@pp-mo wrote:

Assuming that the question still is "would a simple (and fast) load of the single targeted variable fit the need ?" , I am afraid the answer is no, because the variables which are auxiliary coordinates for the targeted variables should be loaded too, as they might be useful at a later stage (and by the way, I agree that using CDO for selecting the variable is then not correct either).

That is absolutely not what this is doing : The approach in SciTools/iris#4176 does still analyse the whole file, but it limits which actual cube(s) are generated -- which basically is the slow bit.
The design goal here is that any particular constrained load will produce exactly the same result as before (only faster).

So I do think that this approach (if done right) should be a safe bet, and a great fit for your needs as described.

To be clear on the method, each loaded Iris cube relates to a specific CF data-variable, and it is those which we are selecting on -- so cubes will still have all the usual information from other (types of) file variables, such as coordinates, ancillary variables etc etc. The equivalence also means that we don't need a control to switch this on+off

The remaining questions for me are ...

whether we can interpret + speedup the types of constraint expression which you need to use, and
whether the speedup works adequately for your actual usecases, and
can we get this into an Iris release by the time you need

zklaus commented 3 years ago

@senesis wrote:

These are good news ! From my point of view, :

the single constraint type would be on the NetCDF variable name
the speed up figures you quoted there are fine, and impressive
the feasibility of using iris.load with a constraint instead of using iris.load_raw in ESMValTool is a question that I cannot answer, and that should be directed to @valeriupredoi ; I understood that this should anyway take some development time, and have no idea of the relative priority of this use case in the general development plan; I ping @bouweandela and @zklaus in case they can help on this subject

zklaus commented 3 years ago

@bouweandela wrote:

the feasibility of using iris.load with a constraint instead of using iris.load_raw in ESMValTool is a question that I cannot answer, and that should be directed to @valeriupredoi ; I understood that this should anyway take some development time, and have no idea of the relative priority of this use case in the general development plan; I ping @bouweandela and @zklaus in case they can help on this subject

If this is a feature you need, could you please open an issue for it? A discussion in a merged pull request is easily lost over time. Regarding the relative priority of new features: from experience, I can tell that those features that have someone who is actively working on them, tend to get the highest priority.

zklaus commented 3 years ago

@pp-mo wrote:

using iris.load with a constraint instead of using iris.load_raw in ESMValTool

If it helps, I can affirm that 'iris.load_raw' is almost never any different from 'iris.load', for netcdf source data. The essential difference between the two is : 'load' attempts a CubeList.merge on the 'raw' cubes, and 'load_raw' does not. It is actually quite unusual for there to be mergeable data in a netcdf load : it requires that there are "raw" cubes which differ only in the values of scalar coordinates (which here means coords with no dimension : a coord mapped to a length-1 dimension is not a candidate). Such cubes also can't come from the same file, because they must have the same _varname (recall : each (raw) cube comes from a single data-var).

Aside : it might seem that 'iris.load' should be capable of merging e,g, data supplied in several files each containing a 'month' of daily data. But it can't. because that would be a concatenate and not a merge operation. We do have quite a few outstanding issues suggesting this could/should change, but no progress at present : https://github.com/SciTools/iris/issues/3344 https://github.com/SciTools/iris/issues/2587 https://github.com/SciTools/iris/issues/3234

zklaus commented 3 years ago

To move forward with this: @senesis, what are the load times you experience with Iris 3.0.2 for a single variable (in the data-variable sense, i.e. including all coordinates, auxiliary variables, etc.) from a multi-variable file? How does that compare with CDO? What speed do we need to achieve?

senesis commented 3 years ago

what are the load times you experience with Iris 3.0.2

Using a typical IPSL-CM multi-variable file (the one with 314 variables used by pp-mo there) in a basic recipe (namely a simplified example_python.yml, with a single dataset, and only the map part) , I get :

with CDO :
- CDO extraction : 1.09s,
- iris load : 0.18s
without CDO :
- iris load : 7.4s

What speed do we need to achieve?

The figures quoted by pp-mo there, namely dividing by 90 the load time for Iris, would be excellent. Staying above 1 s would be a burden : because they are monthly files, a lot of files have to loaded for each recipe.

zklaus commented 3 years ago

Great, thanks. @pp-mo, sorry if you already said this, but things have become a bit scattered and I may have missed something. Is there an easy way for us to test your suggested fix?

pp-mo commented 3 years ago

Is there an easy way for us to test your suggested fix?

Well, there's nothing in a channel ATM. But the code in the draft PR https://github.com/SciTools/iris/pull/4176 has only one commit so far, so if you can test with a checkout you should be able to merge it easily into some branch ?

As given, it can only give speedup for a load with a single NameConstraint, like iris.load(file, NameConstraint(var_name='xx')) Standard or long names will also work : nothing else.
( But there are some obvious extensions we may include - see PR )

zklaus commented 3 years ago

Sounds good. I will have a look at this after the pending release of ESMValTool at the end of this month.

pp-mo commented 3 years ago

N.B. @senesis @bouweandela @zklaus we just cut the Iris 3.1 release candidate.

The https://github.com/SciTools/iris/pull/4176 solution is in there ... https://scitools-iris.readthedocs.io/en/v3.1.0rc0/whatsnew/3.1.html#performance-enhancements

So please check out : This should appear in Iris 3.1, which we will finalise in a couple of weeks' time.

schlunma commented 2 years ago

Moving this to v2.6 since there is not open PR yet.

senesis commented 2 years ago

According to this Iris issue comment (and the few following ones), there is no solution left for optimizing Iris load on multi-variable files. So, regarding IPSLCM fixes, I propose to carry on using CDO but using its python API, through this patch : use_cdo_api.patch.txt , which I tested. Unfortunately, the execution time is 50% to 100% larger when using the CDO API (w.r.t. launching CDO as a subprocess)

zklaus commented 2 years ago

Hm. This execution time penalty sounds less than ideal. Let's keep this issue open with no concrete plans for now and see if we cannot coax better performance out of iris.

pp-mo commented 2 years ago

there is no solution left for optimizing Iris load on multi-variable files

I'm a little confused, are you still not benefitting from https://github.com/SciTools/iris/pull/4176 function ? released since Iris 3.1 as noted above ?

If I understood the problem right, that really ought to "just work" :tm: + fix this problem. If it's still too slow after that, I agree we may have run out of quick wins.

zklaus commented 2 years ago

Thanks, @pp-mo, for pointing that out. I had not done the time testing myself before, but did so now, using the notebook and file that @senesis posted in the iris issue discussion.

It was indeed rapid at 0.5s. (I used iris version 3.2.1).

Since I cannot exclude the possibility that I did something strange with an oldish notebook or mixed up the files, @senesis, could you please check once more the timing?

senesis commented 2 years ago

I'm a little confused, are you still not benefitting from SciTools/iris#4176 function ?

There is still a debate regarding the applicability of using iris.load with constraints in ESMValCore, between @valeriupredoi , who raised concerns here and @zklaus , who answered here

zklaus commented 2 years ago

Using your notebook, @senesis, I confirmed the fast times for all three functions (load, load_raw, and load_cube), so that should not be an issue. Are you still getting slow times?

senesis commented 2 years ago

Not sure I understand your goal : is it

to use iris.load in a file fix upstream of standard ESMValTool data load, or
to use it only in the standard ESMValTool data load (where presently iris.load_raw is used, and without any constraint on variable name) ?

zklaus commented 2 years ago

My main goal is to get rid of CDO. As I understand it, the only reason to introduce it, was a performance problem within Iris. This does not seem to exist any longer, hence no need to resort to command line trickery.

Can you confirm that the problem is gone?

senesis commented 2 years ago

Using Iris3.1 and the quoted notebook, the load time for a single var from an IPSL-CM-like multi-var file is :

8 seconds using iris.load without any iris.constraint on varname (which does not match the ESMValTool behaviour)
3 seconds using iris.load_raw without any iris.constraint on varname (which matches the ESMValTool behaviour)
1.5 seconds using iris.load _raw with an iris.constraint on varname (which does not match the ESMValTool behaviour)
0.4 seconds when first selecting the variable using a CDO subprocess, then iris.load_raw without any constraint (which matches the ESMValTool behaviour with its current fix for IPSL-CM)

Using ESMValTool 2.5, the time for running a simple recipe loading a variable from an actual IPSL-CM mult-var file is

21s without the selection by a CDO subprocess
14 seconds with such a subprocess Because actual files are decadal files, a 7s penalty per file is definitely detrimental.

The explanation for the difference in the penalty between the notebook context and the ESMValTool real use context is that the notebook case uses a one-month long file

zklaus commented 2 years ago

Let's focus first on Iris itself, and then see if ESMValTool throws further spanners into the machine. This may be an opportunity to improve the experience for all users of ESMValTool.

I just repeated the test with the same iris version of 3.1.0, using the file you provided in the description of SciTools/iris#4134 (this one). Using the NameConstraint that already was in the notebook I consistently get times well below 1s for all three iris load methods. It depends a little bit on the load on my computer from other programs, but is between 0.5s and 0.7s. For the CDO version, I get 0.46-0.5. The first access to the file takes a little bit longer, i.e. with a freshly started notebook, whichever method I try first, either one of the iris ones or the CDO one, it takes around 2s. This may be due to filesystem caches or similar.

So overall, it seems iris and CDO performance are more or less on par. At these timescales, I think it is a bit difficult to say more. It might be interesting to look at a more realistic test file to get a better understanding.

Your numbers, @senesis, seem to suggest that the constraint is quite helpful, but I don't really understand what you mean by "without constraint". Isn't the whole point to use a constraint to get only the part you are interested in? Could you elaborate a bit on what you mean by "without constraint"?

senesis commented 2 years ago

@zklaus wrote :

It might be interesting to look at a more realistic test file to get a better understanding.

Settings : working on

a realistic machine: the IPSL ciclad cluster, with
a realistic (decadal, 15Gb) file
and in the notebook context (i.e. using Iris outside of ESMValTool, and with a NameConstraint),

When locating the file on a local file system, and for the first load, the load time is only 20% more with Iris alone than when first selecting using CDO , namely 3.7s vs 3.1s; for next loads on the same file, the difference is even smaller (1.6s vs 1.4s)
When locating the file on a realistic file system for IPSLCM model analysis at IPSL (namely /thredds/tgcc/store, which is the large filesystem somewhat remote and accessible using a high speed WLAN), the difference between both methods for first load is negligible w.r.t. the time needed for getting the data across the network

This means that, for IPSL operations, either on Ciclad cluster or on a system closer to the data store, we could get rid of using CDO if using Iris with a NameConstraint.

By the way : loading with no constraint takes 11s at first load, and ~9.5 s on next loads; using the constraint is thus certainly highly valuable on such large files; thanks to @pp-mo and the Iris team !

I don't really understand what you mean by "without constraint". Isn't the whole point to use a constraint to get only the part you are interested in? Could you elaborate a bit on what you mean by "without constraint"?

@valeriupredoi wrote there :

... working with CMOR-standard files means one variable per file. The specifications of the entire load procedure and its API inside the code would have to change to extract a single variable out the file and that would pose some issues ...

So until further notice I take for granted that using a NameConstraint is not an allowed option for the standard data load in ESMValTool. This is the reason for the question in that earlier post

zklaus commented 2 years ago

Thanks, @senesis, I think things are clear enough now. Iris is performing well enough, we would need to change parts of ESMValTool to take full advantage of this.

I believe that may be worthwhile, since the use-case of extracting variables from large files with may variables is common enough, usually occurring for native model output at least in all of the models that we are starting to support (IPSL, ICON, EC-Earth) and also in some observational datasets, so that support for this might allow native obs support for those. That, is a different discussion though, so I will start an issue about putting in support for this kind of thing and if and when that is present, we can return to this issue to make use of it for IPSL.

bouweandela commented 2 years ago

I changed the title to be a bit more specific, because I believe CDO is also used for other purposes.

Note that in some cases all cubes need to be loaded from a file and not just the requested variable. This happens if files do not follow the CF conventions and a coordinate is interpreted as a cube. In the fix preprocessor step, the coordinate cube is then changed to a coordinate and attached to the cube containing the variable. Therefore we would need something a bit smarter than just applying a constraint at load time.

ESMValGroup / ESMValCore

Remove use of CDO for selecting the right variable in native model fixes #1180