SciTools / iris

A powerful, format-agnostic, and community-driven Python package for analysing and visualising Earth science data
https://scitools-iris.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
633 stars 283 forks source link

Masked array performance #3470

Closed bjlittle closed 9 months ago

bjlittle commented 4 years ago

Given https://github.com/SciTools/iris/pull/3301 and https://github.com/SciTools-incubator/iris-agg-regrid/pull/37 we should consider whether masked arrays with no masked points should be automatically converted to ordinary ndarrays for obvious performance benefits.

Note that python-netcdf4 now automatically always returns masked arrays.

This could be considered as a breaking change in iris behaviour, and so puts it plum in 3.0.0 territory.


As a user of iris, I want comparable regridding performance when using masked array data where no points are masked (i.e. mask=False) and non-masked arrays. I would also like to have the ability to apply this functionality to other areas of my code if I deem it necessary.

I would like the capacility to be avilable to users to apply else where in their code.

Acceptance Criteria

pp-mo commented 4 years ago

masked array data where no points are masked (i.e. mask=False)

That "i.e." isn't quite an equation : You can still have a mask array, but with no True values. For operations where masks are costly, like regridding, it will still be worth the cost of a test + convert. But it probably needs to be on a case-by-case basis. We should have some common code for it, though.

pp-mo commented 4 years ago

... meanwhile, could we not routinely convert the mask=False cases on loading, if that is generally useful ??

rcomer commented 3 years ago

This week I learned that a whole new implementation of masked arrays is afoot: https://github.com/ahaldane/ndarray_ducktypes/blob/master/doc/MaskedArray.md

Note that the owner of that repo is a numpy "member" and another member indicated a consensus for replacing numpy masked arrays "at some point" with something developed outside numpy: https://github.com/numpy/numpy/pull/16022#issuecomment-618616577

trexfeathers commented 1 year ago

First steps of investigation

The netcdf4 package represents unmasked data using a masked array with a full array of masks - [False] with a shape matching the underlying data array. Some performance can be gained by setting the mask to a single False value, thanks to optimisations within NumPy itself for 'non-masked masked arrays' (thanks to @rcomer for this example).

There is however a larger amount of performance to be gained if we can operate on pure NumPy arrays. This may be because implementations in numpy.ma are sometimes piecemeal, so the non-masked optimisation might not be fully implemented. I unfortunately don't have the available resource to investigate within NumPy itself, so I will further pursue the idea of converting to NumPy arrays while operations are being performed.


How I investigated

SciTools/iris-agg-regrid#37 seemed like a good lead, so I used necromancy to get iris-agg-regrid working again and ran the below script.

Expand for the script ```python from time import time import iris from iris.cube import Cube import numpy as np from numpy.random import random from numpy import ma from agg_regrid import AreaWeighted def main(): global_air_temp = iris.load_cube(iris.sample_data_path('air_temp.pp')) rotated_psl = iris.load_cube(iris.sample_data_path('rotated_pole.nc')) def upsample_cube(cube_in: Cube): upsample_factor = 10 new_dim_coords = [] for coord_in in cube_in.dim_coords: points = coord_in.points new_points = np.linspace(points.min(), points.max(), len(points) * upsample_factor) new_dim_coords.append(coord_in.copy(points=new_points)) new_shape = [len(c.points) for c in new_dim_coords] new_data = random(new_shape) return Cube( data=new_data, dim_coords_and_dims=[(c, ix) for ix, c in enumerate(new_dim_coords)] ) global_air_temp = upsample_cube(global_air_temp) rotated_psl = upsample_cube(rotated_psl) for coord_name in ("longitude", "latitude"): src_coord = global_air_temp.coord(coord_name) tgt_coord = rotated_psl.coord(f"grid_{coord_name}") for coord in (src_coord, tgt_coord): coord.guess_bounds() area_weighted = AreaWeighted() def time_regridding(src_cube: Cube): regridder = area_weighted.regridder(src_cube, rotated_psl) time_start = time() _ = regridder(src_cube) time_elapsed = time() - time_start print(f"Elapsed seconds: {time_elapsed}") print("Basic NumPy ...") global_air_temp.data = np.asarray(global_air_temp.data) time_regridding(global_air_temp) print("Simple masked ...") global_air_temp.data = ma.asarray(global_air_temp.data) time_regridding(global_air_temp) print("Full masked ...") full_mask = np.broadcast_to(np.array(False), global_air_temp.data.shape) global_air_temp.data = ma.array(global_air_temp.data.data, mask=full_mask) time_regridding(global_air_temp) if __name__ == "__main__": main() ```

Results:

Basic NumPy ...
Elapsed seconds: 1.890709638595581
Simple masked ...
Elapsed seconds: 1.89862060546875
Full masked ...
Elapsed seconds: 1.6773123741149902

Then I disabled the conversion introduced in SciTools/iris-agg-regrid#37, which produced the below results:

Basic NumPy ...
Elapsed seconds: 1.7257087230682373
Simple masked ...
Elapsed seconds: 4.838908910751343
Full masked ...
Elapsed seconds: 5.676675796508789
trexfeathers commented 1 year ago

Possible approaches

Convert all inputs of a function, re-convert on output

Pros

Cons

Array sub-classes

Pros

Cons

A 'register' of objects to re-convert

Don't know how to implement this. A register of Cubes would work - can run my_cube.data = - but this concept is most useful if it can also work on straight arrays, and these are not contained by another object so assignment isn't possible.

trexfeathers commented 9 months ago

I believe #5178 proves that, while there is a case for optimisation in specific cases, there isn't much benefit in a generic way of temporarily un-masking. Happy to open if others think differently.