Open-EO / openeo-processes

Interoperable processes for openEO's big Earth observation cloud processing.
https://processes.openeo.org
Apache License 2.0
49 stars 15 forks source link

Compute new labels (e.g. bands) easily #233

Open m-mohr opened 3 years ago

m-mohr commented 3 years ago

A common use case is to compute one or multiple new label (e.g. a band) while maintaining the source data cube. So for example compute the ndvi and evi for a data cube with 5 bands (1,2,3,4,5), which results in a data cube with 7 bands (1,2,3,4,5,ndvi,evi).

There are currently at least two ways to achieve this in openEO (pseudo-code):

// Alternative 1: apply_dimension
datacube = load_collection(...)
labels = dimension_labels(datacube, 'bands')
process = function(data) {
  return data.concat([ndvi(data), evi(data)])
}
datacube = apply_dimension(datacube, process, 'bands')
labels = labels.concat(['ndvi', 'evi'])
result = rename_labels(datacube, 'bands', labels) // Rename required as apply_dimension discards labels in this case

// Alternative 2: reduce + merge
datacube = load_collection(...)
process = function(data) { return ndvi(data) }
ndvi = reduce_dimension(datacube, process, 'bands')
ndvi = add_dimension(ndvi, 'bands', 'ndvi', 'bands')
datacube = merge_cubes(datacube, ndvi);
process = function(data) { return evi(data) } // Assuming there's an evi process
evi = reduce_dimension(datacube, process, 'bands')
ndvi = add_dimension(evi , 'bands', 'evi ', 'bands')
result = merge_cubes(datacube, evi);

Both approaches are overly complex and annoying to implement. Therefore, two new ideas for discussion:

// Proposal A: add_labels
datacube = load_collection(...)
datacube = add_labels(datacube, ['ndvi', 'evi'], 'bands') // Adds new labels with no-data values
process = function(data) {
  return array_modify(data, [ndvi(data), evi(data)], count(data, true) - 3, 2)
}
datacube = apply_dimension(datacube, process, 'bands')

// Proposal B: add_computed_labels
datacube = load_collection(...)
process = function(data) {
  return array_create([ndvi(data), evi(data)])
}
ndvi = add_computed_labels(datacube, process, ['ndvi', 'evi'], 'bands') // Computes new values for two bands and adds them with the given labels

What are your thoughts on this? Which proposal is better? Should we add one of them? Both?

I feel like proposal B is the easiest to work with, but add_labels on its own could be useful, too.

mkadunc commented 3 years ago

A (minor) possible issue with this use case is the fact that you want to abuse the 'bands' dimension to introduce a different kind of variable into the data cube, producing a dimension that does not represent spectral bands any more, but rather some arbitrary variable values. And the result is a cube that contains inhomogeneous data types - original band values represent reflectance (or radiance or other measured physical quantity) which is non-negative and can be larger than 1, while the new computed "pseudo" bands contain values of remote sensing indices (typically consistent with a ratio of two physical quantities, e.g. reflectances) that span the range [-1, 1].

When discussing openEO data model, we said that the 'standard' way to store different types of variables was to use separate data cube.

Philosophical considerations aside, if I'd want to do something like that, 'Alternative 1' would feel the most natural, but I would expect apply_dimension to be capable of producing a data cube with labeled dimensions, when they are provided by the nested process, e.g. like this:

datacube = load_collection(...)
process = function(labeled_bands) {
  return labeled_bands.concat([{'ndvi': ndvi(labeled_bands)}, {'evi': evi(labeled_bands))])
}
datacube = apply_dimension(datacube, process, 'bands')
m-mohr commented 3 years ago

Yes, indeed. People just see bands as layers in a file they can use for whatever they like. It's not restricted to spectral bands. But that's an issue throughout the whole EO community, it seems. At least we also had the discussion in STAC with me being the only person voting for separating spectral bands and "layers" in a file. Also, in mostly all use cases that we've done in openEO or will do in Platform, it seems that people just use the bands as described above. So the question is whether it's worth the effort to teach them to do it the other ("strict") way. Also, in other areas we are also less strict with us putting, e.g. quality layers into eo:bands in collection metadata etc. That seems to be common ground across a whole lot of the EO community, too.

With those labeled arrays we have the issue that there's no native way to transmit them through JSON, so we've tried to design them in a way that they actually only come up in callbacks, which is handled internally in back-ends. In an object you can't transmit the order, in an array you can't transmit the labels. So mostly all of our processes are designed to accept labeled arrays from callbacks, but just return normal arrays. That's basically the issue why this is so difficult to do and issues like this come up. I'm really not happy about all this, but introducing labeled arrays throughout the whole set of processes leads to its own challenges again and may make several processes and implementations more difficult.

mkadunc commented 3 years ago

I see. Maybe we could nudge our users a bit by calling such dimensions something other than 'bands' in our own examples, e.g. 'variable'?

Not to push anything, but another option that could be used to fix the problem with apply_dimension could be to allow more axis information to be passed as the fourth parameter, not just the name of the new dimension, e.g.:

   // Alternative 3
    band_labels = dimension_labels(datacube, 'bands')
    datacube = apply_dimension(
        datacube,
        process,
        dimension = 'bands',
        target_dimension = {name: 'variable', labels: band_labels.concat(['ndvi', 'evi'])}
     )

I also have an option 'C' for solving the specific use-case without apply_dimension: we add a process that appends a label and all its data along a specified dimension (similar to Javascript push, but creating a new data cube rather than modifying the existing object):

// Proposal C: append_label
datacube = load_collection(...)
result = datacube

process_ndvi = function(data) { return ndvi(data) }
ndvi = reduce_dimension(datacube, process_ndvi, 'bands')
result = append_label(result, dimension = 'bands', appended_label = 'ndvi', appended_data = ndvi)

process_evi = function(data) { return evi(data) }
evi = reduce_dimension(datacube, process_evi, 'bands')
result = append_label(result, dimension = 'bands', appended_label = 'evi', appended_data = evi)

To me option 'C' is closer to my mental picture of this use-case than either 'A' or 'B'. Also, this naturally extends to the possibility of append_label without the fourth parameter, which just adds the new label filled with no-data values (analogous to add_labels in proposal A).

jdries commented 3 years ago

So I finally arrived at the point where I actually need to do this. Option C does look a lot like using merge_cubes, or am I overlooking something?

So I'm somewhat inclined to use 'alternative 1' now, for adding computed bands.

I also believe that what I need in the use case is not only adding labels to the band dimension, but I also reduce the time dimension, producing multiple outputs per band as opposed to one output per band. This currently looks like this in pseudocode:

features = extended.reduce_dimension(dimension="t",reducer=lambda timeseries:create_array(quantiles(timeseries,q=[0.9,0.5,0.1]),sd(timeseries)))

The quantiles function is interesting in the sence that it returns an array by design, so it seems like we could never really compute multiple quantiles inside a reduce_dimension?

Sidenote: this is relatively easy to implement in the backend, mostly because the bands dimension in my datacube is a list of variables that I can grow and shrink easily.

But seems like I also don't have an immediate alternative here...

mkadunc commented 3 years ago

So I finally arrived at the point where I actually need to do this. Option C does look a lot like using merge_cubes, or am I overlooking something?

You're right, it does - it's basically the same as alternative 2. with single-step process for add_dimension(evi , 'bands', 'evi ', 'bands') and merge_cubes(datacube, evi).

The quantiles function is interesting in the sence that it returns an array by design, so it seems like we could never really compute multiple quantiles inside a reduce_dimension?

According to process documentation, you should use apply_dimension whenever the reducer returns an array. The scalar vs. array return type (of the reducer process graph) seems to also be the only difference between the two processes — apply_dimension and reduce_dimension.

m-mohr commented 3 years ago

I see. Maybe we could nudge our users a bit by calling such dimensions something other than 'bands' in our own examples, e.g. 'variable'?

We have another dimension type for that, called other. Not supported by VITO for example, so they use bands for it. bands work on any back-end, other just on some. So the question is whether we go the practical approach or the one that is better, but doesn't work?

another option that could be used to fix the problem with apply_dimension could be to allow more axis information to be passed as the fourth parameter, not just the name of the new dimension

Yes, this could also be an option, although I've never seen anyone actually use the target_dimension but instead usually just writing back to the bands etc.

I also have an option 'C' for solving the specific use-case without apply_dimension: we add a process that appends a label and all its data along a specified dimension (similar to Javascript push, but creating a new data cube rather than modifying the existing object):

Yes, that's pretty much what I've also thought about with add_labels in the first post, but append_label allows to actually set a data cube for the data in addition. So I like that approach, too, as it's very generic and could be useful in general. So I think I'll work on something in this direction, but of course @jdries can start with the workarounds we have right now.

The quantiles function is interesting in the sence that it returns an array by design, so it seems like we could never really compute multiple quantiles inside a reduce_dimension?

Yes, quantiles (also: extrema) is not a reducer, but needs apply_dimension to be used. That's why they are not in the reducer category. We decided at some point during discussions that a reducer is strictly only returning a single value (which in theory could be an array, but that's probably not a good idea.).

But seems like I also don't have an immediate alternative here...

I've asked for feedback for weeks, but except for @mkadunc no response, so no "immediate" solution for sure. I can provide proposals in a day or two (outside of vacations), but that needs some consensus and input from others.

jdries commented 3 years ago

Another way to put it is as an apply_neighborhood that works on the full temporal and band dimension, and only maintains one label for the temporal dimension and multiple for bands.

There seems to be some inconsistency in the definition of apply_neighborhood related to that. In the description:

The process must not add new dimensions, or remove entire dimensions, but the result can have different dimension labels.

In the return type description:

The dimension properties (name, type, labels, reference system and resolution) remain unchanged.

m-mohr commented 3 years ago

It feels to me that this is too much of a stretch for a process that is not meant to be used in this way, as you already mentioned. It's not meant to reduce dimensions.

jdries commented 3 years ago

Dev telco conclusion: I will try to use apply_dimension with target_dimension set to the existing band dimension. It seems that this should have the effect of removing the time dimension. I will concatenate the produced values, resulting in the new band dimension.

m-mohr commented 2 years ago

I wasn't really aware that we use the apply_dimension for flattening purposes, we'll need to check whether that makes sense long-term and is covered by the documentation. Related issues:

m-mohr commented 2 years ago

Just posting the old add(_computed)_labels proposals here for completeness. These are likely to be superseded by another proposal.

{
    "id": "add_labels",
    "summary": "Adds new labels to a dimension.\n\nThis is especially useful to compute new bands with ``apply_dimension()``.",
    "description": "Adds one or more new labels to the given `dimension`.",
    "categories": [
        "cubes"
    ],
    "parameters": [
        {
            "name": "data",
            "description": "A data cube to add the dimension to.",
            "schema": {
                "type": "object",
                "subtype": "raster-cube"
            }
        },
        {
            "name": "labels",
            "description": "The name of the dimension over which to reduce. Fails with a `LabelExists` exception if one of the specified label exists already.",
            "schema": {
                "type": "array",
                "minItems": 1,
                "items": {
                    "type": "string"
                }
            }
        },
        {
            "name": "value",
            "description": "The value to set for the given labels. Defaults to `null` (no-data).",
            "schema": {
                "description": "Any data type."
            },
            "default": null,
            "optional": true
        }
    ],
    "returns": {
        "description": "The data cube with a newly added dimension. The new dimension has exactly one dimension label. All other dimensions remain unchanged.",
        "schema": {
            "type": "object",
            "subtype": "raster-cube"
        }
    },
    "exceptions": {
        "DimensionExists": {
            "message": "A dimension with the specified name already exists."
        },
        "LabelExists": {
            "message": "A label with the specified name already exists."
        }
    }
}
{
    "id": "add_computed_labels",
    "summary": "Adds new labels with new values",
    "description": "Adds one or more new labels with newly computed values to the given `dimension`.",
    "categories": [
        "cubes"
    ],
    "parameters": [
        {
            "name": "data",
            "description": "A data cube to add the dimension to.",
            "schema": {
                "type": "object",
                "subtype": "raster-cube"
            }
        },
        {
            "name": "process",
            "description": "Process to be applied on pixel values. The specified process needs to accept an array and must return an array with exactly the number of elements that are given to the parameters `labels`. A process may consist of multiple sub-processes.",
            "schema": {
                "type": "object",
                "subtype": "process-graph",
                "parameters": [
                    {
                        "name": "data",
                        "description": "A labeled array with elements of any type.",
                        "schema": {
                            "type": "array",
                            "subtype": "labeled-array",
                            "items": {
                                "description": "Any data type."
                            }
                        }
                    },
                    {
                        "name": "context",
                        "description": "Additional data passed by the user.",
                        "schema": {
                            "description": "Any data type."
                        },
                        "optional": true,
                        "default": null
                    }
                ],
                "returns": {
                    "description": "The value to be set in the new data cube.",
                    "schema": {
                        "description": "Any data type."
                    }
                }
            }
        },
        {
            "name": "labels",
            "description": "The name of the dimension over which to reduce. Fails with a `LabelExists` exception if one of the specified label exists already.",
            "schema": {
                "type": "array",
                "minItems": 1,
                "items": {
                    "type": "string"
                }
            }
        },
        {
            "name": "dimension",
            "description": "The name of the dimension to apply the process on and to add the labels to. Fails with a `DimensionNotAvailable` exception if the specified dimension does not exist.",
            "schema": {
                "type": "string"
            }
        },
        {
            "name": "context",
            "description": "Additional data to be passed to the process.",
            "schema": {
                "description": "Any data type."
            },
            "optional": true,
            "default": null
        }
    ],
    "returns": {
        "description": "The data cube with a newly added dimension. The new dimension has exactly one dimension label. All other dimensions remain unchanged.",
        "schema": {
            "type": "object",
            "subtype": "raster-cube"
        }
    },
    "exceptions": {
        "DimensionExists": {
            "message": "A dimension with the specified name already exists."
        },
        "LabelExists": {
            "message": "A label with the specified name already exists."
        }
    }
}