SciTools / iris

A powerful, format-agnostic, and community-driven Python package for analysing and visualising Earth science data
https://scitools-iris.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
633 stars 283 forks source link

Allow cubes/coords/etc to share data #3172

Open DPeterK opened 6 years ago

DPeterK commented 6 years ago

In some cases it is highly advantageous for cubes to be able to share a data object, which Iris currently cannot handle. This means Iris can in some cases produce views of data and not copies.

Here's @pp-mo's take on this topic:

IMHO we should aim to be "like numpy". In this context, that means in the worst cases (e.g. indexing) :

"Result is usually a view, but in some cases a copy." "it's too complicated to explain exactly when." "it might change in future releases"

There is some prior on this topic, including #2261, #2549 and #2584, #2681 and #2691 . These reflect the importance of this topic. However, given the potential for unexpected behaviour that this change will bring, further thought is still required.

pp-mo commented 6 years ago

unexpected behaviour

Some key points from my prior thought on this ...

pp-mo commented 2 years ago

Iris 3.2 and the unstructured data model

Since v3.2 / unstructured, we do finally get cubes which share some components : that is, any cube.mesh

Summary of some relevant facts about new datamodel objects

Basic relevant facts + ideas

Mesh

Meshcoords

Are a sort of "convenience" component ..

Mesh location coordinates and Connectivites

Sharing of dimensional components (potentially big arrays)

This is a relevant issue, simply because unstructured data comes with a lot of associated mesh information : large coordinate + connectivity arrays Typically, much larger than structured equivalents for the same size of data

Mesh Coordinates and Connectivities are effectively shared between cubes, since they belong to the Mesh, which also is. -- though, identical meshes loaded from different files cannot currently be identified and shared

Any related AuxCoord/CellMeasure/Ancil on the unstructured dimension can not be shared They can be lazy, of course, but each Cube will have it's own copy

pp-mo commented 1 year ago

Discussed briefly offline with @hdyson, since he and IIRC @cpelley were the original users most concerned about the inefficiency of this.

His recollection of what "the problem" to be addressed was, was somewhat different ... He thinks it was in the context of combining multiple results into a single array to then be saved, rather than to do with sharing of components in loaded data.

The thing is, sharing of partial data arrays by multiple cubes is already possible For example:

>>> data = np.zeros((10,))
>>> c1, c2, c99 = Cube(data[:5]), Cube(data[5:]), Cube(data[4:8])
>>> c1.data[3] = 7
>>> c2.data[:4] = 99
>>> c99.data[:] = 50
>>> data
array([ 0.,  0.,  0.,  7., 50., 50., 50., 50., 99.,  0.])
>>> c1.data
array([ 0.,  0.,  0.,  7., 50.])
>>> c2.data
array([50., 50., 50., 99.,  0.])
>>> c99.data
array([50., 50., 50., 50.])
>>> 
pp-mo commented 1 year ago

In the course of the above discussion, I rather revised my thoughts.

My understanding is that the major opportunity for inefficiency is where multiple cubes contain identical components, such as aux-coords, ancillary-variables or cell measures. It doesn't really apply to cube data, since we don't generally expect cube data to be linked.

If all those cube-components' data may be realised, then there is an obvious inefficiency. ( e.g. there was a period when saving cubes realised all aux-coords -- though that is now fixed).
If these contain real data, then this could easily be shared, as the above cube data examples show. However, normally, when loaded from file, these components would contain multiple lazy arrays, referencing the same data in the file.

So, in the lazy case, it is quite possible that some cube operations might load all that data, or at least transiently fetch it multiple times (e.g. within computation of a lazy result, or a save). I think there is no clean way to "link" the separate lazy arrays, but it should be possible for the cubes to share either the cube components themselves -- i.e. the objects, such aux-coords -- or, within those, their DataManager's. Effectively, this is already happening with Meshes. With that provision, realising the components would "cache" the data and not re-read it (still less allocate additional array space). However, that in itself would still not improve lazy operations, -- including lazy streaming during netcdf writes -- since dask does not cache results, and the lazy content would still be re-fetched multiple times. To address that, It would be possible to implement a caching feature within NetCDFDataProxy objects, but that approach is not very controllable -- and could itself cause problems, if the total data size of a single object is large (in which case, storing only one chunk at a time may be highly desirable).

In short, we may need to focus more carefully on what the common problems cases actually are, since I think there has been some confusion here in the past, and all the solutions so far proposed may have potential drawbacks.