SciTools / iris

A powerful, format-agnostic, and community-driven Python package for analysing and visualising Earth science data
https://scitools-iris.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
630 stars 283 forks source link

A Dataless Cube #4447

Open bjlittle opened 2 years ago

bjlittle commented 2 years ago

✨ Feature Request

I think it's healthy to challenge established norms...

I want the ability to create a dataless cube. By this I mean the ability to create a hyper-space defined only by metadata i.e., no data payload

Once data is added to the cube, then the dimensionality is established and locked down, as we traditionally know and accept.

Motivation

Such hyper-spaces could be used in various ways e.g.,

I'm sure there are more concrete use cases... Please do share them on this issue if you know or any πŸ™

Traditionally, there are many situations where a cube enforcing that it must have data is simply an inconvenience. Given the natural progression of model resolutions it seems "just wrong" to abuse dask to create lazy data that will never be used. It reeks of something not being quite right to me.

Let's do something about that πŸ˜‰

Please up vote this issue if you'd like to see this happen πŸ‘

Steps

DPeterK commented 2 years ago

@bjlittle supermegahypercubes! That is, a cube that describes how huge numbers of incoming datasets would tile together to make an n-dimensional hyperstructure - think, for example, of representing an entire model run in a single object. This would ideally be represented as a metadata-only cube, with individual data payloads very much fetched on demand only, given the vast quantities of data such an object would represent.

We've considered this idea from a variety of different perspectives in the Informatics Lab, and we think it has legs. We've also given the idea a bunch of different names, but supermegahypercubes is the best, most whimsical and original name we came up with for the concept πŸ™‚

pp-mo commented 2 years ago

@bjlittle are you including here the idea that possibly only some of the data might be "filled", with some of it left unidentified. So, that might be closer to the idea previously suggested which I think was maybe called a "hypercube", probably in the Informatics Lab ? IIRC that was certainly raised before but we never managed to get around to seriously considering it. ( @DPeterK I can't find an issue link for this -- maybe can you help ? )

P.S. as a name, for that idea at least, I think "hypothicube" is neater (though for language purists that should probably be "hypothecube" πŸ˜‰ )

edmundhenley-mo commented 2 years ago

@bjlittle - re your concrete use-cases: If useful to see some (~pedestrian, non hyp[er|o]cube-y) code-in-wild examples of target hyperspace for interpolation/regridding, I've got a couple here (sorry, only viewable internally@MO). Almost certainly not optimal, but guessing poss still useful to see non-expert usage!

edmundhenley-mo commented 2 years ago

@pp-mo - dunno re issue, but wonder if you're recalling the part-filled example in Jacob's hypotheticube article? Or poss another Informatic Lab article? (here's @DPeterK 's one on supermegahypercubes

philip-brohan commented 1 year ago

I feed streams of cubes through Machine Learning software (TensorFlow - TF). This requires throwing away the metadata and operating only on the data arrays, and then laboriously reconstructing metadata around the output data. It would be great to be able to cut a cube into data and metadata components, process them seperately and recombine them later.

pp-mo commented 3 months ago

In Dragon Taming :tm: discussion today, I suggested that we should AFAP "contain" code changes within the DataManager class, i.e. no or minimal change should be required in Cube code.

Just as a hint for implementation, it is also very simple to make a lazy array which has no data, so can participate normally in any lazy operations, but can't be fetched. You just need an object which supports : shape, dtype, ndim and __getitem__, and you wrap it with dask.array.from_array : I've written code like this a few times, now !

Here's a simple working example.

import dask.array as da
import numpy as np

class FakeArray:
    def __init__(self, shape, dtype):
        if not isinstance(dtype, np.dtype):
            dtype = np.dtype(dtype)
        self.dtype = dtype
        self.shape = shape
        self.ndim = len(shape)  # Dask requires ndim as well as shape, for some reason

    def __getitem__(self, keys):
        raise ValueError("FakeArray cannot be read.")

def lazy_fake(shape, dtype=np.float64):
    """A functional lazy array with known shape and dtype, but no actual data."""
    arr = FakeArray(shape, dtype)
    # Note: must pass 'meta' to from_array, to prevent it making a test data access
    meta = np.zeros((), dtype=arr.dtype)
    return da.from_array(arr, meta=meta)
>>> my_fake = lazy_fake((3, 4), 'i2')
>>> print('fake = ', my_fake)
fake =  dask.array<array, shape=(3, 4), dtype=int16, chunksize=(3, 4), chunktype=numpy.ndarray>
>>> print('fake.meta = ', repr(my_fake._meta))
fake.meta =  array([], shape=(0, 0), dtype=int16)
>>> print('fake[0] = ', my_fake[0])
fake[0] =  dask.array<getitem, shape=(4,), dtype=int16, chunksize=(4,), chunktype=numpy.ndarray>
fake[0] =  dask.array<getitem, shape=(4,), dtype=int16, chunksize=(4,), chunktype=numpy.ndarray>
Traceback (most recent call last):
  File "/home/h05/itpp/Support/periods/period_20240710_ugridsprintx1/dev/fake_arrays.py", line 29, in <module>
    print(my_fake.compute())
          ^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/iris3/lib/python3.11/site-packages/dask/base.py", line 342, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/iris3/lib/python3.11/site-packages/dask/base.py", line 628, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/h05/itpp/Support/periods/period_20240710_ugridsprintx1/dev/fake_arrays.py", line 14, in __getitem__
    raise ValueError("FakeArray cannot be read.")
ValueError: FakeArray cannot be read.
>>> 
edmundhenley-mo commented 3 months ago

To clarify my (mis)understanding of what you mean @pp-mo - the DataManager class is in user-space code? i.e. user-written and maintained, not part of iris?

pp-mo commented 3 months ago

To clarify my (mis)understanding of what you mean @pp-mo - the DataManager class is in user-space code? i.e. user-written and maintained, not part of iris?

Ah no, not that actually. The DataManager is absolutely a part of Iris. It encapsulates the different types of array content that we can have in a cube.data or coord.points/bounds + gives them a common API. For now, that basically means real or lazy array.

So I was just hoping that, since we have already have this class encapsulating the possible array types, it would be neat if we can support "dataless" purely by extending what a DataManager can do, rather than by making a bunch of changes elsewhere, e.g. in the Cube class.

pp-mo commented 3 months ago

P.S. further clarification (hopefully) My previous code example is also suggesting that it might be possible to implement dataless content as "just a special lazy array".
It's not yet clear if it can be quite that simple, though.
And even if it can, we might still want to distinguish "dataless" content in a more definite way.

ESadek-MO commented 1 week ago

We have looked into dataless cubes. We've decided that the first step into dataless cubes is to create a cube with coords, but no data. You can create a cube with nothing in it, but creating an empty cube with coords throws an error; coords need dimensions.

This is checked via ndims , which has no setter. This is calculated in the dataManager, using shape.

We believe that shape should be settable, but only (and non-optionally) if data hasn't been set. This will require changing the DataManager.

DataManager(data, shape:optional):