ESMValGroup / ESMValCore

ESMValCore: A community tool for pre-processing data from Earth system models in CMIP and running analysis scripts.
https://www.esmvaltool.org
Apache License 2.0
42 stars 38 forks source link

Added auxiliary coordinates lead to (time) efficiency loss #195

Closed BenMGeo closed 4 years ago

BenMGeo commented 5 years ago

I recently realized that we experience quite an efficiency loss due to our added coordinates (by the backend) concerning time.

The backend adds the auxiliary coordinates day_of_year, day_of_month, year, month_number. If you leave them in for further calculations based on collapsing along the time axis in diagnostics, this leads to a significant increase of processing time!

Here is a "simple" example that shows the increasing run time:

import iris
import iris.coord_categorisation
from scipy.stats import linregress

import functools
import time

def timer(func):
    # from https://realpython.com/primer-on-python-decorators/#simple-decorators
    @functools.wraps(func)
    def wrapper_timer(*args, **kwargs):
        start_time = time.perf_counter()    
        value = func(*args, **kwargs)
        end_time = time.perf_counter()      
        run_time = end_time - start_time    
        print(f"Finished {func.__name__!r} in {run_time:.4f} secs")
        return value
    return wrapper_timer

filename = iris.sample_data_path('E1_north_america.nc')
air_temp = iris.load_cube(filename, 'air_temperature')

# removing extra coordinates
air_temp_small = air_temp.copy()

air_temp_small.remove_coord("forecast_reference_time")
air_temp_small.remove_coord("forecast_period")

# adding extra coordinates
air_temp_big = air_temp.copy()

iris.coord_categorisation.add_month_number(air_temp_big,"time","month")
iris.coord_categorisation.add_day_of_year(air_temp_big,"time","doy")

@timer
def Mean_Cube(cube, coords):
    # replacement for a more complex function
    return cube.collapsed(coords, iris.analysis.MEAN)
small_time = Mean_Cube(air_temp_small, "time")
# Finished 'Mean_Cube' in 0.0035 secs
print(small_time)
# ...
big_time = Mean_Cube(air_temp_big, "time")
# Finished 'Mean_Cube' in 2.2839 secs
print(big_time)
# ...

The processing time increases by a factor of ~1000!!!

print(big_time.data - small_time.data)
# = all zeroes!

The content itself is the same.

This does not occur for collapsing along dimensions without auxiliary coordinates.

small_lat = Mean_Cube(air_temp_small, "latitude")
# Finished 'Mean_Cube' in 0.0046 secs
print(small_lat)
# ...

big_lat = Mean_Cube(air_temp_big, "latitude")
# Finished 'Mean_Cube' in 0.0052 secs
print(big_lat)
# ...

This needs to be told to the diagnostic developers if they use iris in their diagnostics (unknown if this is relevant for other packages besides iris). I'm not sure if this is relevant for the backend, though.

I will address this issue also to the iris community, as I think this is a solvable performance issue. (as discussed with @bjoernbroetz)

mattiarighi commented 5 years ago

Transferred to core since it affects the preprocessor.

BenMGeo commented 5 years ago

Transferred to core since it affects the preprocessor.

But do not forget to add any relevant outcome to the documentation (therefore, I raised the issue in the Main Tool repository).

BenMGeo commented 5 years ago

The accompanying issue at iris (https://github.com/SciTools/iris/issues/3363) @bjlittle

valeriupredoi commented 5 years ago

nice one @BenMGeo :beer:

valeriupredoi commented 5 years ago

also +1 for using decorators :grin: Will you be in touch with the iris guys find out when the patch will go in, I suppose iris 2.3

BenMGeo commented 5 years ago

I don't know if we should/can push (like in "make it urgent") the pull request over at iris. I'm not sure if I have the capacity to test the changes rcomer implemented. I also assume that the changes will be in a soonish release as tests are implemented and they realized that it is a severe issue.

BenMGeo commented 5 years ago

It seems to be included in 2.3 as the pull request was merged for this milestone: https://github.com/SciTools/iris/milestone/40

bouweandela commented 5 years ago

The backend adds the auxiliary coordinates day_of_year, day_of_month, year, month_number.

Maybe we could move this into a separate preprocessor function, for those users who need it? I don't think this is part of CMOR or cf-conventions, so I'm not sure why we're doing this anyway. @jvegasbsc Can you comment on this?

jvegreg commented 5 years ago

We decided to add them because they are useful at several preprocessor functions and adding them at this step was thought to be better than dealing with them at each point they are used.

It is not a problem at all to remove that part of the code and change the functions to add them if needed

bouweandela commented 5 years ago

It is not a problem at all to remove that part of the code and change the functions to add them if needed

I think that might be better, it makes the code easier to understand if all code that does one thing is in one place. Also, we already have a preprocessor function that needs to remove these extra coordinates before it can do it's job: https://github.com/ESMValGroup/ESMValCore/blob/08eb1e4c0b03d811de54fb1f2209349302ebbdbb/esmvalcore/preprocessor/_time.py#L316-L320

We could add a new preprocessor function to add any of the extra coordinates provided by iris.coord_categorisation.add_* if diagnostics need them, preferably quite late in the preprocessing chain to minimize the risk of accidentally deleting those coordinates again.

BenMGeo commented 5 years ago

Just make sure that the diagnostic editors know about that change. Most of my code currently assumes that these coordinates are there, as I set them as a standard. I will happily change that, but others might run into issues if they are not aware.

BenMGeo commented 4 years ago

@mattiarighi you want me to test if still an issue for iris 2.3?

mattiarighi commented 4 years ago

Yes, that would be great. Thank you!

BenMGeo commented 4 years ago

So... The good news: it is faster! The bad news: it is still a little slower than without the extra coordinates!

small_time = Mean_Cube(air_temp_small, "time")
# Finished 'Mean_Cube' in 0.0035 secs
print(small_time)
# ...
big_time = Mean_Cube(air_temp_big, "time")
# Finished 'Mean_Cube' in 0.0231 secs
print(big_time)
# ...

If the slow down of factor ~10 is acceptable for you, this is ok. I will still delete the coordinates as deleting costs less time.

mattiarighi commented 4 years ago

Would you be willing to implement @jvegasbsc + @bouweandela suggestion?

BenMGeo commented 4 years ago

I need to dive into this code part and make sure to have it backward compatible. If neither of the original editors has the possibility to do it, I'll give it a go.

jvegreg commented 4 years ago

I can do it. It is not specially complicated and I need something easy to get back on track after holidays

BenMGeo commented 4 years ago

Thank you @jvegasbsc. Tell me if you need something for support.

jvegreg commented 4 years ago

One question, do we really need a preprocessor function to add those coordinates it? It is fairly easy (and maybe clearer) to add the coordinate in the diagnostic if needed.

The only case in which I think we may need it is if they are used on diagnostics that are not Iris based, but I don't know if that is the case

valeriupredoi commented 4 years ago

Quite a few that are not iris based, man, either xarray or cdo (see @earnone 's and a few others)

jvegreg commented 4 years ago

Quite a few that are not iris based, man, either xarray or cdo (see @earnone 's and a few others)

But are they using them? You don't need those coordinates to do a selmon with CDO...

valeriupredoi commented 4 years ago

That I dunno, too complicated a question given I'm still on holidays :grin: :beer:

bouweandela commented 4 years ago

One question, do we really need a preprocessor function to add those coordinates it? It is fairly easy (and maybe clearer) to add the coordinate in the diagnostic if needed.

Maybe you can grep for those coordinate names in the existing diagnostics? That will give you a good impression of how commonly they are used.

earnone commented 4 years ago

I don't think any of us is using them. I agree running cdo does not need them.

bouweandela commented 4 years ago

@mattiarighi Do you know of any NCL diagnostics using this? I quickly checked and couldn't find anything, but for the year coordinate is difficult to see at a glance.

mattiarighi commented 4 years ago

year seems to be used quite often in NCL scripts, however it is hard to tell whether it is internally defined or required from the preprocessor. I will test the affected recipes to make sure.

The other coordinates are never used.

BenMGeo commented 4 years ago

In my diagnostics I just delete all aux coordinates if there are any. So I have no compatibility problems.

mattiarighi commented 4 years ago

year seems to be used quite often in NCL scripts, however it is hard to tell whether it is internally defined or required from the preprocessor. I will test the affected recipes to make sure.

See checklist.