Cache for area-weighted regridder

corinnebosley commented 7 years ago

This has been a major issue for several people, here are a few links to provide some context: https://github.com/SciTools/iris/issues/2370 https://exxconfigmgmt:6391/browse/EPM-1542

@abooton - Updating the description 03/12/2019. As a user of the area-weighted regridder I would like to cache the area-weights (as well as snapshot the grid info) so that I can reduce the time taken to regrid multiple cubes.

Description: As described in the iris documentation: https://scitools.org.uk/iris/docs/latest/userguide/interpolation_and_regridding.html?highlight=regridding#caching-a-regridder "If you need to regrid multiple cubes with a common source grid onto a common target grid you can ‘cache’ a regridder to be used for each of these regrids. This can shorten the execution time of your code as the most computationally intensive part of a regrid is setting up the regridder."

Unfortunately the weights are not currently cached, so the benefit described is not realised when carrying out area-weighted regridding. It is noted that although, at present, the majority of time is currently spent calculating the weighted-mean, computing the weights can be significant e.g ~25% for non-masked arrays (not masked for which stats are reported on below).

See #2370 for a good example of setting the regridder up.

Acceptance Criteria:

[x] Main API: regrid_area_weighted_rectilinear_src_and_grid API should be maintained as it is
[x] Optimisation: Regridding one cube should not take longer than it does at present. (According to the ASV benchmarking)
[x] The area-weights should only be computed once when using the regridder method iris.analysis.AreaWeighted().regridder(cube1, cube2) (see example in #2370)
[x] If using masked data, but the mask is all false, treat as if it is non-masked during the meaning calculation.
[x] The regridder should retain grid checking for compatibility with the src_cube
[x] The current infrastructure should be followed (i.e. __prepare and __perform)
[x] The _regrid_area_weighted code should remain in 'experimental/regrid.py' module (moving it into the main code is out of scope)
[x] Weights should be calculated akin to now (i.e. reading from a file is out-of-scope)
[x] Weights should be included as part of the regridder cache (i.e. reading from a file, or making the weights available for further use is out-of-scope)
[x] If weights computation is refactored, ASV testing is required
[ ] If numba is used for code speed-up, it should be implemented on an "if available - use it" basis
[x] Any other performance results should preferably reference the examples in this or the associated tickets (see above).

Note: The weights are currently computed alongside the weighted mean calculation, in the loop. If the weights calculation is refactored, the grid points will be looped over twice instead of once. Therefore, it is suggested that the work is developed in a feature branch, and implemented once code refactoring and optimisation are both complete.

cpelley commented 7 years ago

The same is true for all of the regridders currently in iris I believe (Linear/Nearest/AreaWeighted). No caching of weights and indices takes place.

marqh commented 7 years ago

i would like to put this together into a use case for some planned enhancement work we have for later this year

further comments and thoughts on this, to keep it in my notifications, are much appreciated

tv3141 commented 7 years ago

Use case: The AreaWeighted is currently the only conservative scheme.

AreaWeighted is ~1000 time slower than Linear or Nearest.

Is it necessary to have two ways of doing regridding?

pass a regridding scheme to the regrid method of a cube
create a Regridder object

If the weight creation is the most expensive step, a regridding function could return the weights explicitly, and feed them into the regridding function the next time. This would not add more complexity to the user than using Regridder objects, but would potentially be much simpler.

Testing the current caching method seems difficult. (Is it caching? Is it using the cache? it there a benefit using the cache?)

cpelley commented 7 years ago

@tv3141 the current UI allows for a generic interface to regridding/interpolation and for the implementation details be hidden from the user. This framework is what allows caching to be developed without changing the UI for users. I think the issue you raise would be better approached by extending the docstrings which should highlight current capabilities and limitations. help(iris.analysis.Linear) for example is rather lacking in necessary details.

tv3141 commented 7 years ago

The bottleneck is averaging the source grid points using np.ma.average. The weight calculation takes <10%.

The area-weighted regridding is done in a large loop over the target grid points. The weights are calculated inside the loop for each target grid point. The regridding is done in one step for all dimensions (time, levels).

https://github.com/SciTools/iris/blob/v1.12.x/lib/iris/experimental/regrid.py#L503

def _regrid_area_weighted_array(src_data, x_dim, y_dim,
                                src_x_bounds, src_y_bounds,
                                grid_x_bounds, grid_y_bounds,
                                grid_x_decreasing, grid_y_decreasing,
                                area_func, circular=False, mdtol=0):

    [...]

    # Simple for loop approach.
    for j, (y_0, y_1) in enumerate(grid_y_bounds):

        [...]

        for i, (x_0, x_1) in enumerate(grid_x_bounds):

            [...]

                data = src_data[tuple(indices)]

                # Calculate weights based on areas of cropped bounds.
                weights = area_func(y_bounds, x_bounds)

                [...]

                # Calculate weighted mean taking into account missing data.
                new_data_pt = _weighted_mean_with_mdtol(data, 
                    weights=weights, axis=axis, mdtol=mdtol)

                # Insert data (and mask) values into new array.

                [...]

                new_data[tuple(indices)] = new_data_pt

    [...]

    return new_data

def _weighted_mean_with_mdtol(data, weights, axis=None, mdtol=0):
    """
    ...
    """
    res = ma.average(data, weights=weights, axis=axis)

    [...]

    return res

Profiling

around 3/4 of the time is spent in np.ma.average
weight calculation takes currently less than 10% of the time
linear complexity
largest factor: number of target grid points
run time is almost independent of source grid
increasing the amount of data (more time steps or levels) increases the run with a small linear factor

import cProfile
import pstats
import iris

areaweighted_regridder = iris.analysis.AreaWeighted().regridder(create_cube(*resolution['N768']), \
                                                                create_cube(*resolution['N96'])) 
cube = create_cube(*resolution['N768'], levels=3)
cProfile.run('areaweighted_regridder(cube)', 'aw_regrid.prof')
stats = pstats.Stats('aw_regrid.prof')
stats.strip_dirs()
stats.sort_stats('cumulative')
stats.print_stats()

Tue May  2 17:05:24 2017    aw_regrid.prof

         9573273 function calls (9573143 primitive calls) in 12.114 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   12.114   12.114 <string>:1(<module>)
        1    0.000    0.000   12.114   12.114 _area_weighted.py:89(__call__)
        1    0.000    0.000   12.114   12.114 regrid.py:627(regrid_area_weighted_rectilinear_src_and_grid)
        1    0.487    0.487   12.110   12.110 regrid.py:405(_regrid_area_weighted_array)
    27648    0.070    0.000    9.490    0.000 regrid.py:353(_weighted_mean_with_mdtol)
--> 27648    0.270    0.000    9.400    0.000 extras.py:464(average)
   138241    1.139    0.000    3.060    0.000 core.py:2865(__array_finalize__)
    27648    0.066    0.000    2.861    0.000 core.py:3908(__truediv__)
    55296    0.230    0.000    2.842    0.000 core.py:1002(reduce)
   248833    1.224    0.000    2.785    0.000 core.py:2839(_update_from)
    27648    0.774    0.000    2.769    0.000 core.py:1109(__call__)
    ...
    27648    0.041    0.000    0.801    0.000 regrid.py:324(_spherical_area)  # weights

`np.ma.average` 10x slower than `np.average`

import numpy as np
print np.__version__

arr = np.random.rand(1,8,8,3)
w = np.random.rand(1,8,8,3)

print 'np.ma.average axis'
print np.ma.average(arr, weights=w, axis=(1,2))
%timeit a = np.ma.average(arr, weights=w, axis=(1,2))
print

print 'np.average axis'
print np.ma.average(arr, weights=w, axis=(1,2))
%timeit a = np.average(arr, weights=w, axis=(1,2))

1.12.1
np.ma.average axis
[[0.5558622851924222 0.5015630215812361 0.512327031679091]]
1000 loops, best of 3: 240 µs per loop

np.average axis
[[0.5558622851924222 0.5015630215812361 0.512327031679091]]
10000 loops, best of 3: 27.4 µs per loop

Even without mask np.ma.average is ~10x slower than the normal array version np.average. I do not understand why, identical lines of code have large differences in execution time.

import numpy as np
print np.__version__  # 1.12.1
from line_profiler import LineProfiler

arr = np.random.rand(1,8,8,3)
w = np.random.rand(1,8,8,3)

lp = LineProfiler()
lp_wrapper = lp(np.ma.average)
lp_wrapper(arr, weights=w)
lp.print_stats()

lp = LineProfiler()
lp_wrapper = lp(np.average)
lp_wrapper(arr, weights=w)
lp.print_stats()

Timer unit: 1e-06 s

Total time: 0.000392 s
File: ~/localdata/python_test/miniconda/envs/iris_new_np/lib/python2.7/site-packages/numpy/ma/extras.py
Function: average at line 521

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   521                                           def average(a, axis=None, weights=None, returned=False):

   572         1          105    105.0     26.8      a = asarray(a)
   573         1            4      4.0      1.0      m = getmask(a)
   574                                           
   575                                               # inspired by 'average' in numpy/lib/function_base.py
   576                                           
   577         1            1      1.0      0.3      if weights is None:
   578                                                   avg = a.mean(axis)
   579                                                   scl = avg.dtype.type(a.count(axis))
   580                                               else:
   581         1            6      6.0      1.5          wgt = np.asanyarray(weights)
   582                                           
   583         1            3      3.0      0.8          if issubclass(a.dtype.type, (np.integer, np.bool_)):
   584                                                       result_dtype = np.result_type(a.dtype, wgt.dtype, 'f8')
   585                                                   else:
   586         1            3      3.0      0.8              result_dtype = np.result_type(a.dtype, wgt.dtype)
   587                                           
   588                                                   # Sanity checks
   589         1            2      2.0      0.5          if a.shape != wgt.shape:

   605         1            1      1.0      0.3          if m is not nomask:
   606                                                       wgt = wgt*(~a.mask)
   607                                           
   608         1           35     35.0      8.9          scl = wgt.sum(axis=axis, dtype=result_dtype)
   609         1          230    230.0     58.7          avg = np.multiply(a, wgt, dtype=result_dtype).sum(axis)/scl
   610                                           
   611         1            1      1.0      0.3      if returned:
   612                                                   if scl.shape != avg.shape:
   613                                                       scl = np.broadcast_to(scl, avg.shape).copy()
   614                                                   return avg, scl
   615                                               else:
   616         1            1      1.0      0.3          return avg

Timer unit: 1e-06 s

Total time: 9.7e-05 s
File: ~/localdata/python_test/miniconda/envs/iris_new_np/lib/python2.7/site-packages/numpy/lib/function_base.py
Function: average at line 1021

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1021                                           def average(a, axis=None, weights=None, returned=False):

  1095                                               # 3/19/2016 1.12.0:
  1096                                               # replace the next few lines with "a = np.asanyarray(a)"
  1097         1            3      3.0      3.1      if (type(a) not in (np.ndarray, np.matrix) and

  1106         1            3      3.0      3.1      if not isinstance(a, np.matrix):
  1107         1            7      7.0      7.2          a = np.asarray(a)
  1108                                           
  1109         1            1      1.0      1.0      if weights is None:
  1110                                                   avg = a.mean(axis)
  1111                                                   scl = avg.dtype.type(a.size/avg.size)
  1112                                               else:
  1113         1            4      4.0      4.1          wgt = np.asanyarray(weights)
  1114                                           
  1115         1            4      4.0      4.1          if issubclass(a.dtype.type, (np.integer, np.bool_)):
  1116                                                       result_dtype = np.result_type(a.dtype, wgt.dtype, 'f8')
  1117                                                   else:
  1118         1            3      3.0      3.1              result_dtype = np.result_type(a.dtype, wgt.dtype)
  1119                                           
  1120                                                   # Sanity checks
  1121         1            2      2.0      2.1          if a.shape != wgt.shape:

  1137         1           32     32.0     33.0          scl = wgt.sum(axis=axis, dtype=result_dtype)
  1138         1           16     16.0     16.5          if (scl == 0.0).any():
  1139                                                       raise ZeroDivisionError(
  1140                                                           "Weights sum to zero, can't be normalized")
  1141                                           
  1142         1           20     20.0     20.6          avg = np.multiply(a, wgt, dtype=result_dtype).sum(axis)/scl
  1143                                           
  1144         1            1      1.0      1.0      if returned:
  1145                                                   if scl.shape != avg.shape:
  1146                                                       scl = np.broadcast_to(scl, avg.shape).copy()
  1147                                                   return avg, scl
  1148                                               else:
  1149         1            1      1.0      1.0          return avg

tv3141 commented 7 years ago

To learn more about np.ma.average, I copied its code in a Jupyer notebook, and started removing non-essential code. I noticed that running a copy of the function np.ma.average in the current namespace speeds it up more than 10x. It also produces single precision results instead of double precision.

Any ideas what makes np.ma.average slow, and what causes the speed-up? What determines the precision of the calculation?

import inspect
import numpy as np
print np.__version__  # 1.12.1

# imports for copied code
from numpy import asarray
from numpy.ma import getmask, nomask

def ma_average(a, axis=None, weights=None, returned=False):
    # copied code from
    # print inspect.getsourcefile(np.ma.average)

arr = np.random.rand(1,8,8,3)
w = np.random.rand(1,8,8,3)

print 'np.ma.average'
print np.ma.average(arr, weights=w, axis=(1,2))
%timeit a = np.ma.average(arr, weights=w, axis=(1,2))
print

print 'ma_average'
print ma_average(arr, weights=w, axis=(1,2))
%timeit a = ma_average(arr, weights=w, axis=(1,2))
print

print 'np.average'
print np.ma.average(arr, weights=w, axis=(1,2))
%timeit a = np.average(arr, weights=w, axis=(1,2))
print

np.ma.average
[[0.5798026614723218 0.45410607639507333 0.4378237178069648]]
1000 loops, best of 3: 238 µs per loop

ma_average
[[ 0.57980266  0.45410608  0.43782372]]
10000 loops, best of 3: 21.9 µs per loop

np.average
[[0.5798026614723218 0.45410607639507333 0.4378237178069648]]
10000 loops, best of 3: 27.4 µs per loop

tv3141 commented 7 years ago

np.ma.average calls a = asarray(a) to convert a into a masked array. When I copied the code I imported the wrong asarray function:

from numpy import asarray

when it should have been

from numpy.ma import asarray

This also makes the 'local' ma.average slow again, so things make sense again.

np.ma.average is much slower than np.average when applied to small unmasked arrays, because of the expensive conversion to a masked array np.ma.asarray(a), and np.multiply(arr) being much slower when called with a masked array.

cpelley commented 6 years ago

I realised I perhaps neglected to mention that I would also really like to see caching implemented in iris :) Really interesting observations on masked arrays. Feels like quite a significant refactor to make this happen. To make weights really useful, these should also be accessible to the end user (example: users might use them directly to sparse arrays) and secondly, perhaps having these cached to disk or something. A lot to think about I think...

abooton commented 4 years ago

Re-order to be [..., y_dim, _dim]. See PR #3594 to feature branch (was originally PR #3587)
Ensure x_dim and y_dim. See PR #3595 to feature branch. (was originally PR #3588)
Optimise averaging: Move averaging out of loop and refactor weights. See PR #3596 to feature branch (was originally PR #3589)

This section has been reviewed in the "area_weighted_regridding" feature branch. It is self-contained and gives the same results as current master: See PR #3598

abooton commented 4 years ago

Update from refinement meeting:

The current status of the work was discussed. We think completing it is achievable with approx: • ½ day review work associated with re-jigging the current regrid_area_weighted_rectilinear_src_and_grid function • 4 days refactoring the tests (1 day for weights function, 1 day for prepare, perform, 1 day regridder, 1 day review comments) • 1 day review work for the above • Assistance from the sprint team (e.g. setting up an Iris build, updating test data) • (This is additional to the ~week of scoping/dev work already undertaken by AVD)

This nicely aligns with the 5 days dev work that Emma/Andy are providing.

abooton commented 4 years ago

Optimisation AC - time to process a single cube is quicker than previously see #2370

abooton commented 4 years ago

The feature branch, upstream/area_weighted_regridding can be merged. There are four outstanding items: 1) Some extra comments describing the regrid_info weight variables would be useful 2) The "jit" could be added in the future for further speed up 3) The ASV test needs further investigation as this should be demonstrated a speed up. (Our other testing shows the optimization delivers speed-up) 4) The dtype of the data returned remains unchanged. However, we think the current regridding should return the original data dtype, but it isn't necessarily doing this at present.

cpelley commented 4 years ago

Hi @abooton apologies if this isn't useful or I'm too late, but for what it's worth:

Awesome stuff! The issue of offline weights that you rightly keep off the table for this ticket might be worth talking about when you are ready - we have done this for our own ESMF regridder scheme.
I wonder what the implications are for memory usage here from these changes. Do you have any thoughts? - I think my gut would be that it would be that it should still be reasonable given the alternative approaches available which certainly give iris a lot more wiggle in this regard :)

...However, we think the current regridding should return the original data dtype...

I'm not 100% sure I have understood the context here but I would have thought it reasonable/flexible that area weighted regridding return data which relates to the source dtype by the following relation:

tgt_dtype = np.promote_types(src_dtype, 'float16')

Hope this helps in some way.

Merry Christmas!

abooton commented 4 years ago

The following quick check indicates we expect the ASV tests should have seen a noticeable improvement:

# commit  738093f1:  time: areawieghted_regridder 13.124 s  pre-optimisation
# commit  0d1ddb40:  time: areawieghted_regridder 1.575 s   post-optimisation

from iris import tests
import iris
from time import time

def regridding_test():
    file_path = tests.get_data_path(
                ["NetCDF", "global", "xyt", "SMALL_hires_wind_u_for_ipcc4.nc"]
            )
    cube = iris.load_cube(file_path)
    scheme_area_w = iris.analysis.AreaWeighted()
    ts = time()
    cube.regrid(cube, scheme_area_w)
    tf = time()
    print('areawieghted_regridder',round(tf - ts, 3), 's')

regridding_test()

abooton commented 4 years ago

Hi @cpelley,

Thanks for your thoughts.

Regarding the memory use:

I wonder what the implications are for memory usage here from these changes. Do you have any thoughts? - I think my gut would be that it would be that it should still be reasonable given the alternative approaches available which certainly give iris a lot more wiggle in this regard :)

We'd not specifically looked at this as the user requirement was for reducing computation time. The regridder object now retains the weights (which may be similar in size to the 2D data array, or might be bigger as they are float64). I would think the memory requried to average the data would be more, particularly for 2D diagnostics, because 1) the weights are pushed into a additional array and 2) the src data is initially stored in an array larger than that of the target grid prior to computing the mean. (Previously the weights and mean were computed independently by looping through every point on the target grid, so I think the memory would have been released between each calculation). It should be noted that the total memory useage of the new code could easily be optimised in a future PR, by initally storing the weights in the numpy array structure required by the regridding "perform" code. This further refactoring hasn't been done yet, as we didn't wish to introduce a bug into the weights calculations.

Regarding the data type:

I'm not 100% sure I have understood the context here but I would have thought it reasonable/flexible that area weighted regridding return data which relates to the source dtype by the following relation: tgt_dtype = np.promote_types(src_dtype, 'float16')

To clarify, the changes here retain the current dtype behaviour. However I was concerned that integer source data can be returned as type float.

src_dtype = np.int32
tgt_dtype = np.promote_types(src_dtype, 'float16')
print(tgt_dtype)
>> float64

cpelley commented 4 years ago

...could easily be optimised in a future PR, by initally storing the weights in the numpy array structure required by the regridding "perform" code...

Awesome! Yes this is how we made ESMF regridding approachable (the memory overhead there really is too large). Do feel free to give me a shout for a chat if/when you think it might be useful to do so.

To clarify, the changes here retain the current dtype behaviour...

I don't think AreaWeighted/bilinear regridding should ever return integer types. Performing area conservative regridding implies to me floats in the return since area calculations should be float. The result needs to be always both locally and globally conservative.

ehogan commented 4 years ago

I wonder what the implications are for memory usage here from these changes. Do you have any thoughts? - I think my gut would be that it would be that it should still be reasonable given the alternative approaches available which certainly give iris a lot more wiggle in this regard :)

@cpelley, using memory_profiler I generated the following plots by loading a cube (where cube.data.shape = (31, 160, 320)), then performing regridding 10 times:

using 738093f, profiling regrid_area_weighted_rectilinear_src_and_grid: 738093f

using the feature branch, profiling _regrid_area_weighted_rectilinear_src_and_grid__prepare and _regrid_area_weighted_rectilinear_src_and_grid__perform: updated

Edit: There is an ~15% increase in the average memory usage, but an ~90% decrease in the total time taken to perform the regridding.

cpelley commented 4 years ago

Generating data using float16 vs float64 reveals huge differences. That is, in hindsight, I don't think data recorded as integers should be treated as anything but the highest precision (float64). Integer data doesn't mean precision only to an integer. Example here is the land cover type fraction ancillary generation (int8->float64). I think I would make sense to always promote to float64 on this basis as not doing so assumes what the users data represents. Sorry for going back and fourth on this.

pp-mo commented 4 years ago

I think this is probably addressed by #3617 It's complicated, so need to spend some effort to confirm this is the case ?

trexfeathers commented 4 years ago

The following quick check indicates we expect the ASV tests should have seen a noticeable improvement: commit 738093f1: time: areawieghted_regridder 13.124 s pre-optimisation commit 0d1ddb40: time: areawieghted_regridder 1.575 s post-optimisation

This has been addressed by #3660 - asv now demonstrates a change in performance at the appropriate commit.

abooton commented 4 years ago

closed by #3623

Outstanding (numba) acceptance criteria is "nice to have".

SciTools / iris