Open DPeterK opened 6 years ago
unexpected behaviour
Some key points from my prior thought on this ...
Since v3.2 / unstructured, we do finally get cubes which share some components : that is, any cube.mesh
Are a sort of "convenience" component ..
This is a relevant issue, simply because unstructured data comes with a lot of associated mesh information : large coordinate + connectivity arrays Typically, much larger than structured equivalents for the same size of data
Mesh Coordinates and Connectivities are effectively shared between cubes, since they belong to the Mesh, which also is. -- though, identical meshes loaded from different files cannot currently be identified and shared
Any related AuxCoord/CellMeasure/Ancil on the unstructured dimension can not be shared They can be lazy, of course, but each Cube will have it's own copy
Discussed briefly offline with @hdyson, since he and IIRC @cpelley were the original users most concerned about the inefficiency of this.
His recollection of what "the problem" to be addressed was, was somewhat different ... He thinks it was in the context of combining multiple results into a single array to then be saved, rather than to do with sharing of components in loaded data.
The thing is, sharing of partial data arrays by multiple cubes is already possible For example:
>>> data = np.zeros((10,))
>>> c1, c2, c99 = Cube(data[:5]), Cube(data[5:]), Cube(data[4:8])
>>> c1.data[3] = 7
>>> c2.data[:4] = 99
>>> c99.data[:] = 50
>>> data
array([ 0., 0., 0., 7., 50., 50., 50., 50., 99., 0.])
>>> c1.data
array([ 0., 0., 0., 7., 50.])
>>> c2.data
array([50., 50., 50., 99., 0.])
>>> c99.data
array([50., 50., 50., 50.])
>>>
In the course of the above discussion, I rather revised my thoughts.
My understanding is that the major opportunity for inefficiency is where multiple cubes contain identical components, such as aux-coords, ancillary-variables or cell measures. It doesn't really apply to cube data, since we don't generally expect cube data to be linked.
If all those cube-components' data may be realised, then there is an obvious inefficiency.
( e.g. there was a period when saving cubes realised all aux-coords -- though that is now fixed).
If these contain real data, then this could easily be shared, as the above cube data examples show.
However, normally, when loaded from file, these components would contain multiple lazy arrays, referencing the same data in the file.
So, in the lazy case, it is quite possible that some cube operations might load all that data, or at least transiently fetch it multiple times (e.g. within computation of a lazy result, or a save). I think there is no clean way to "link" the separate lazy arrays, but it should be possible for the cubes to share either the cube components themselves -- i.e. the objects, such aux-coords -- or, within those, their DataManager's. Effectively, this is already happening with Meshes. With that provision, realising the components would "cache" the data and not re-read it (still less allocate additional array space). However, that in itself would still not improve lazy operations, -- including lazy streaming during netcdf writes -- since dask does not cache results, and the lazy content would still be re-fetched multiple times. To address that, It would be possible to implement a caching feature within NetCDFDataProxy objects, but that approach is not very controllable -- and could itself cause problems, if the total data size of a single object is large (in which case, storing only one chunk at a time may be highly desirable).
In short, we may need to focus more carefully on what the common problems cases actually are, since I think there has been some confusion here in the past, and all the solutions so far proposed may have potential drawbacks.
In some cases it is highly advantageous for cubes to be able to share a data object, which Iris currently cannot handle. This means Iris can in some cases produce views of data and not copies.
Here's @pp-mo's take on this topic:
There is some prior on this topic, including #2261, #2549 and #2584, #2681 and #2691 . These reflect the importance of this topic. However, given the potential for unexpected behaviour that this change will bring, further thought is still required.