Allow cubes/coords/etc to share data

DPeterK commented 6 years ago

In some cases it is highly advantageous for cubes to be able to share a data object, which Iris currently cannot handle. This means Iris can in some cases produce views of data and not copies.

Here's @pp-mo's take on this topic:

IMHO we should aim to be "like numpy". In this context, that means in the worst cases (e.g. indexing) :

"Result is usually a view, but in some cases a copy." "it's too complicated to explain exactly when." "it might change in future releases"

There is some prior on this topic, including #2261, #2549 and #2584, #2681 and #2691 . These reflect the importance of this topic. However, given the potential for unexpected behaviour that this change will bring, further thought is still required.

pp-mo commented 6 years ago

unexpected behaviour

Some key points from my prior thought on this ...

the key practicality + API design question is "in what context may an Iris operation produce a result which shares data with another Iris object".
the key goal is to control it so it only happens when you expect it or asked for it.
lazy content could confuse this : when does it get evaluated, can that encapsulate a behaviour switch from when it was created (e.g.) ?
the biggy IMHO : Once it is anyway possible to have (e.g.) 2 cubes which share some data, then any operation which can modify its inputs might produce different results. You just can't logically avoid that. Even something as simple as "a = a + b" is potentially affected.

pp-mo commented 2 years ago

Iris 3.2 and the unstructured data model

Since v3.2 / unstructured, we do finally get cubes which share some components : that is, any cube.mesh

Summary of some relevant facts about new datamodel objects

Basic relevant facts + ideas

cubes loaded from the same file can share a mesh, even if they map different locations
copying a cube with a Mesh results in a cube with the same mesh
- because of the way that MeshCoords copy
- -- we don't actually provide a Mesh.copy() anyway. Since it is not modifiable, not clear why you would
slicing+indexing a cube currently loses the Mesh, instead of linking/copying
- because MeshCoords won't slice, and cube indexing converts a Coord thats fail to slice into an AuxCoord
- however, in future, it may well make sense to have sub-indexing create a cube with a location-index-set
  - (see : https://github.com/SciTools/iris/discussions/4438)
if we do implement location-index-sets, from the Cube perspective they would simply be equivalent to a mesh
- so, they would logically be shareable in the same way as a Mesh

Mesh

does not support copy : we expect multiple things that use it to cross-refer
is mapped to only one Cube data dimension, only via a MeshCoord, and therefore not :
- a cube component (like Coord/Ancil/CellMeasure)
- a _DimensionalMetadata subclass
- indexable as part of sub-indexing a cube

Meshcoords

Are a sort of "convenience" component ..

they "just" represent a relationship between a cube (and its dims) and a Mesh
they are AuxCoords, but don't represent anything in a CF dataset
- thus, they have standard/long/varname and units/attributes ..
- .. but these are basically non-functional, don't "mean" anything, aren't used for anything
- so, there is clearly an argument for these to not be AuxCoords but some distinct, more limited class : the current arrangement is pragmatic (as for Connectivity being a _DimensionalMetadata -- see below).
they are not shared between cubes (but in future could be, if any Coords are ?)
they support copying, and are copied on cube copy
they do not support sub-indexing ..
- .. but are replaced with ordinary AuxCoords on cube indexing (see above)

Mesh location coordinates and Connectivites

are not attached to a cube, or its dims, but only to the Mesh
therefore, implicitly, shared + not copied (between cubes of the same mesh)
so, like a Mesh, they aren't a cube component ..
.. but they are dimensional, and mapped to a Mesh dimension
unlike MeshCoords, they do represent objects in a CF dataset
- so they do have meaningful standard/log/var-name + units + attributes
Mesh location coordinates : are just ordinary AuxCoords (for now at least)
Mesh Connectivities : at present are a subclass of _DimensionalMetadata
- but this is not logical, really just a convenience / anomaly and could reasonably change
- so .. they are in principle indexable and copyable, but this is not really useful or used anywhere at present

Sharing of dimensional components (potentially big arrays)

This is a relevant issue, simply because unstructured data comes with a lot of associated mesh information : large coordinate + connectivity arrays Typically, much larger than structured equivalents for the same size of data

Mesh Coordinates and Connectivities are effectively shared between cubes, since they belong to the Mesh, which also is. -- though, identical meshes loaded from different files cannot currently be identified and shared

Any related AuxCoord/CellMeasure/Ancil on the unstructured dimension can not be shared They can be lazy, of course, but each Cube will have it's own copy

like regular (structured data) Coords
unlike the Mesh coords + connectivities

pp-mo commented 1 year ago

Discussed briefly offline with @hdyson, since he and IIRC @cpelley were the original users most concerned about the inefficiency of this.

His recollection of what "the problem" to be addressed was, was somewhat different ... He thinks it was in the context of combining multiple results into a single array to then be saved, rather than to do with sharing of components in loaded data.

The thing is, sharing of partial data arrays by multiple cubes is already possible For example:

>>> data = np.zeros((10,))
>>> c1, c2, c99 = Cube(data[:5]), Cube(data[5:]), Cube(data[4:8])
>>> c1.data[3] = 7
>>> c2.data[:4] = 99
>>> c99.data[:] = 50
>>> data
array([ 0.,  0.,  0.,  7., 50., 50., 50., 50., 99.,  0.])
>>> c1.data
array([ 0.,  0.,  0.,  7., 50.])
>>> c2.data
array([50., 50., 50., 99.,  0.])
>>> c99.data
array([50., 50., 50., 50.])
>>>

pp-mo commented 1 year ago

In the course of the above discussion, I rather revised my thoughts.

My understanding is that the major opportunity for inefficiency is where multiple cubes contain identical components, such as aux-coords, ancillary-variables or cell measures. It doesn't really apply to cube data, since we don't generally expect cube data to be linked.

If all those cube-components' data may be realised, then there is an obvious inefficiency. ( e.g. there was a period when saving cubes realised all aux-coords -- though that is now fixed).
If these contain real data, then this could easily be shared, as the above cube data examples show. However, normally, when loaded from file, these components would contain multiple lazy arrays, referencing the same data in the file.

So, in the lazy case, it is quite possible that some cube operations might load all that data, or at least transiently fetch it multiple times (e.g. within computation of a lazy result, or a save). I think there is no clean way to "link" the separate lazy arrays, but it should be possible for the cubes to share either the cube components themselves -- i.e. the objects, such aux-coords -- or, within those, their DataManager's. Effectively, this is already happening with Meshes. With that provision, realising the components would "cache" the data and not re-read it (still less allocate additional array space). However, that in itself would still not improve lazy operations, -- including lazy streaming during netcdf writes -- since dask does not cache results, and the lazy content would still be re-fetched multiple times. To address that, It would be possible to implement a caching feature within NetCDFDataProxy objects, but that approach is not very controllable -- and could itself cause problems, if the total data size of a single object is large (in which case, storing only one chunk at a time may be highly desirable).

In short, we may need to focus more carefully on what the common problems cases actually are, since I think there has been some confusion here in the past, and all the solutions so far proposed may have potential drawbacks.

SciTools / iris