Open-EO / openeo-processes

Interoperable processes for openEO's big Earth observation cloud processing.
https://processes.openeo.org
Apache License 2.0
48 stars 15 forks source link

rank band composite #271

Closed jdries closed 4 months ago

jdries commented 3 years ago

Currently, in openEO, when creating composites, we apply a reducer function over the timeseries for a band. This can result in unobserved values in the sense that a particular combination of band values may not occur in the original dataset, because bands are composited independently. reduce_dimension(reducer='max', dimension='t')

A variation on this topic is computing the 'medoid' over all bands. It is more complex in the sense that you need all bands available, and the medoid process would take a 2D (or higher dimensional?) matrix as input.

GEE seems to have explicit support for this: https://developers.google.com/earth-engine/apidocs/ee-imagecollection-qualitymosaic https://developers.google.com/earth-engine/apidocs/ee-algorithms-landsat-simplecomposite?hl=en

Original feedback:

Just a few things to keep in mind here. For the percentiles for example, there are percentile metrics that we can calculate on a per-band basis, but more relevant are often percentile (ranked) composites: here we look at the distribution of a rank band (e.g. SWIR2) and then identify the percentile of interest (e.g. p90), then the select the respective acquisition and use all band values of that acquisition to fill the percentile (ranked) composite.. This is an essential difference. Science users would not want to use the P90 identified independently across bands as it results in synthetic/unobserved data (similar issues with mean metrics). Both things (metrics and ranked composites) would be good to have (and are easy to implement on the front end side). This post might illustrate these concepts a bit more: https://glad.umd.edu/book/export/html/656

m-mohr commented 3 years ago

What priority / SRR assignment does this have? @jdries

jonathom commented 3 years ago

I'm going to try and advance this further. However I'm not sure I exactly understand the medoid example.

I made an example calculation for a single pixel for understanding reduce_dimension(reducer = "max", dim = "t"):

Normally this would result into the maximum per band band1 band2 band3
time1 5 9 32
time2 8 0 6
time3 16 3 18

But what we want is the maximum of combined band values, i.e. the maximum that existed at the same point in time:

band1 band2 band3
time1 5 9 32
time2 8 0 6
time3 16 3 18

Correct me if I am getting this wrong, please.

Now what I don't understand is how the timestep to use is determined in a percentile (e.g. median) computation. A median as it exists now would yield this result:

band1 band2 band3
time1 5 9 32
time2 8 0 6
time3 16 3 18

What is the idea of a medoid here?

The given feedback says

here we look at the distribution of a rank band (e.g. SWIR2) and then identify the percentile of interest (e.g. p90), then the select the respective acquisition and use all band values of that acquisition to fill the percentile (ranked) composite

but how exactly is this acquisition selected?

jonathom commented 3 years ago

@jdries or could you link the author of the feedback you cited?

jdries commented 3 years ago
I think there's two different cases, your example, (and also the medoid function) looks at all band values to select a single observation. The other case, referred to as 'rank band composite', composites all other bands based on the value of a rank band. So if band 1 is my rank band, and use the 'max' function to compsite, then we get: band1 band2 band3
time1 5 9 32
time2 8 0 6
time3 16 3 18

If band 3 is the rank band, a max composite gives:

band1 band2 band3
time1 5 9 32
time2 8 0 6
time3 16 3 18

A 'max-NDVI' composite is one well-known example of such an operation.

m-mohr commented 3 years ago

As a start, @jonathom and I tried to come up with a solution for a criteria-based observed composite process that re-uses existing processes:

{
  "process_graph": {
    "2": {
      "process_id": "reduce_dimension",
      "arguments": {
        "data": {
          "from_parameter": "data"
        },
        "reducer": {
          "from_parameter": "criteria"
        },
        "dimension": "bands"
      }
    },
    "3": {
      "process_id": "apply_dimension",
      "arguments": {
        "data": {
          "from_node": "2"
        },
        "process": {
          "process_graph": {
            "2": {
              "process_id": "array_find",
              "arguments": {
                "value": {
                  "from_node": "3"
                },
                "data": {
                  "from_parameter": "data"
                }
              }
            },
            "3": {
              "process_id": "max",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                }
              }
            },
            "6": {
              "process_id": "array_apply",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                },
                "process": {
                  "process_graph": {
                    "1": {
                      "process_id": "constant",
                      "arguments": {
                        "x": null
                      },
                      "result": true
                    }
                  }
                }
              }
            },
            "7": {
              "process_id": "array_modify",
              "arguments": {
                "data": {
                  "from_node": "6"
                },
                "values": {
                  "from_node": "3"
                },
                "index": {
                  "from_node": "2"
                }
              },
              "description": "Create an array with null values",
              "result": true
            }
          }
        },
        "dimension": "t"
      }
    },
    "4": {
      "process_id": "merge_cubes",
      "arguments": {
        "cube2": {
          "from_node": "3"
        },
        "cube1": {
          "from_parameter": "data"
        },
        "overlap_resolver": {
          "process_graph": {
            "2": {
              "process_id": "if",
              "arguments": {
                "value": {
                  "from_node": "4"
                },
                "accept": {
                  "from_parameter": "x"
                }
              },
              "result": true
            },
            "4": {
              "process_id": "neq",
              "arguments": {
                "x": {
                  "from_parameter": "y"
                },
                "y": null
              }
            }
          }
        }
      }
    },
    "5": {
      "process_id": "reduce_dimension",
      "arguments": {
        "data": {
          "from_node": "4"
        },
        "reducer": {
          "process_graph": {
            "1": {
              "process_id": "first",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                }
              },
              "result": true
            }
          }
        },
        "dimension": "t"
      },
      "result": true
    }
  },
  "parameters": [
    {
      "schema": {
        "type": "object",
        "subtype": "raster-cube",
        "title": "Raster data cube",
        "description": "A raster data cube, an image collection stored at the back-end. Different back-ends have different internal representations for this data structure."
      },
      "name": "data"
    },
    {
      "name": "criteria",
      "schema": {
        "type": "object",
        "subtype": "process-graph",
        "title": "User-defined process",
        "description": "An process graph that is passed as an argument and is expected to be executed by the process. Parameters passed to the process graph are specified in the `parameters` property of the corresponding schema."
      },
      "description": "A reducer to select a timestamp, e.g. ``sum`` or ``mean()``."
    }
  ],
  "id": "composite_observed"
}

For the medoid-based composite, we'd probably need a separate process that works on data cubes.

jonathom commented 3 years ago

.. and yet another process for the rank-based composite. We wondered which of these processes would be the most important to you @jdries?

jonathom commented 3 years ago

I made up some short use case descriptions that I hope can be helpful here:

pixel integrity composite

  1. I want to create a cloud-free mosaic of SAR data that only contains observed pixel values. The mosaic should be created by selecting the minimum pixel value. Only one point in time shall be selected to fill the different band values (VV, VH) of a single pixel. To achieve this, the total minimum (sum) of combined band values per timestep should be calculated for the overall minimum calculation.

  2. An optical satellite time series shall be comprised to a mosaic by selecting the maximum value per pixel. Only one point in time shall be selected to fill the different band values (B03, B04, B05 ..) of a single pixel. To achieve this, the median of combined band values (maybe after normalizaton) should be calculated to then select the maximum timestep overall.

  3. I want to create an optical mosaic of an area of interest. The mosaic should be calculated using the median of the timeseries and only contain a) observed values (no calculated mean between two values) and b) observed combinations of band values (all band values from the same point in time). (This aims for describing the medoid)

rank band composite

  1. I want to create a mosaic of a collection (containing a band called 'ndvi') where the chosen pixels correspond to the point in time of maximum NDVI. Other then time being reduced there are no other changes.

  2. I want to create a mosaic of a collection (containing multiple sensors (as dimension), all with ndvi bands) where the chosen pixels correspond to the point in time of maximum combined NDVI. The combined NDVI shall be computed via a median computation.

m-mohr commented 2 years ago

Thanks, @jonathom. So the next step would be to move these into process graphs and figure out what is missing and whether we can provide UDPs.

jonathom commented 2 years ago

The graph above satisfies UC 1.1 and 1.2 (except normalization) (Will look into medoid (1.3) separately). I would change the id to "observed_pixel_composite" and add a description like "Create a composite over time that contains only observed combinations of band values. A single point in time is selected per pixel to fill in all band values.".

Here's a graph for the rank band composite, the previous graph only needed minor changes. With this, we also have a graph for 2.1. I believe that 2.2 is not needed yet, and I would like to know if this is actually needed IRL.

{
  "process_graph": {
    "3": {
      "process_id": "apply_dimension",
      "arguments": {
        "data": {
          "from_node": "8"
        },
        "process": {
          "process_graph": {
            "2": {
              "process_id": "array_find",
              "arguments": {
                "value": {
                  "from_node": "3"
                },
                "data": {
                  "from_parameter": "data"
                }
              }
            },
            "3": {
              "process_id": "max",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                }
              }
            },
            "6": {
              "process_id": "array_apply",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                },
                "process": {
                  "process_graph": {
                    "1": {
                      "process_id": "constant",
                      "arguments": {
                        "x": null
                      },
                      "result": true
                    }
                  }
                }
              }
            },
            "7": {
              "process_id": "array_modify",
              "arguments": {
                "data": {
                  "from_node": "6"
                },
                "values": {
                  "from_node": "3"
                },
                "index": {
                  "from_node": "2"
                }
              },
              "description": "Create an array with null values",
              "result": true
            }
          }
        },
        "dimension": "t"
      }
    },
    "4": {
      "process_id": "merge_cubes",
      "arguments": {
        "cube2": {
          "from_node": "3"
        },
        "cube1": {
          "from_parameter": "data"
        },
        "overlap_resolver": {
          "process_graph": {
            "2": {
              "process_id": "if",
              "arguments": {
                "value": {
                  "from_node": "4"
                },
                "accept": {
                  "from_parameter": "x"
                }
              },
              "result": true
            },
            "4": {
              "process_id": "neq",
              "arguments": {
                "x": {
                  "from_parameter": "y"
                },
                "y": null
              }
            }
          }
        }
      }
    },
    "5": {
      "process_id": "reduce_dimension",
      "arguments": {
        "data": {
          "from_node": "4"
        },
        "reducer": {
          "process_graph": {
            "1": {
              "process_id": "first",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                }
              },
              "result": true
            }
          }
        },
        "dimension": "t"
      },
      "result": true
    },
    "6": {
      "process_id": "filter_bands",
      "arguments": {
        "data": {
          "from_parameter": "data"
        },
        "bands": {
          "from_parameter": "rank_band_name"
        },
        "wavelengths": []
      }
    },
    "8": {
      "process_id": "drop_dimension",
      "arguments": {
        "data": {
          "from_node": "6"
        },
        "name": "bands"
      }
    }
  },
  "parameters": [
    {
      "schema": {
        "type": "object",
        "subtype": "raster-cube",
        "title": "Raster data cube",
        "description": "A raster data cube, an image collection stored at the back-end. Different back-ends have different internal representations for this data structure."
      },
      "name": "data"
    },
    {
      "schema": {
        "type": "string"
      },
      "name": "rank_band_name",
      "description": "Name of the rank band."
    },
    {
      "schema": {
        "type": "object",
        "subtype": "process-graph",
        "title": "User-defined process",
        "description": "An process graph that is passed as an argument and is expected to be executed by the process. Parameters passed to the process graph are specified in the `parameters` property of the corresponding schema.",
        "required": [
          "process_graph"
        ],
        "properties": {
          "process_graph": {
            "type": "object",
            "additionalProperties": {
              "type": "object",
              "required": [
                "process_id",
                "arguments"
              ],
              "properties": {
                "process_id": {
                  "type": "string"
                },
                "arguments": {}
              }
            }
          }
        }
      },
      "name": "timestep_selector",
      "description": "A process to select a timestep (e.g. `min`, `max`)."
    }
  ],
  "id": "rank_band_composite",
  "description": "Create a composite according to a rank band and a process, e.g. NDVI and maximum."
}
jonathom commented 2 years ago

With the previously mentioned changes, the process graph for an observed pixel composite is this:

{
  "process_graph": {
    "2": {
      "process_id": "reduce_dimension",
      "arguments": {
        "data": {
          "from_parameter": "data"
        },
        "reducer": {
          "from_parameter": "criteria"
        },
        "dimension": "bands"
      },
      "position": [
        0,
        0
      ]
    },
    "3": {
      "process_id": "apply_dimension",
      "arguments": {
        "data": {
          "from_node": "2"
        },
        "process": {
          "process_graph": {
            "2": {
              "process_id": "array_find",
              "arguments": {
                "value": {
                  "from_node": "3"
                },
                "data": {
                  "from_parameter": "data"
                }
              }
            },
            "3": {
              "process_id": "max",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                }
              }
            },
            "6": {
              "process_id": "array_apply",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                },
                "process": {
                  "process_graph": {
                    "1": {
                      "process_id": "constant",
                      "arguments": {
                        "x": null
                      },
                      "result": true
                    }
                  }
                }
              }
            },
            "7": {
              "process_id": "array_modify",
              "arguments": {
                "data": {
                  "from_node": "6"
                },
                "values": {
                  "from_node": "3"
                },
                "index": {
                  "from_node": "2"
                }
              },
              "description": "Create an array with null values",
              "result": true
            }
          }
        },
        "dimension": "t",
        "context": {
          "from_parameter": "timestep_selector"
        }
      },
      "position": [
        240,
        0
      ]
    },
    "4": {
      "process_id": "merge_cubes",
      "arguments": {
        "cube2": {
          "from_node": "3"
        },
        "cube1": {
          "from_parameter": "data"
        },
        "overlap_resolver": {
          "process_graph": {
            "2": {
              "process_id": "if",
              "arguments": {
                "value": {
                  "from_node": "4"
                },
                "accept": {
                  "from_parameter": "x"
                }
              },
              "result": true
            },
            "4": {
              "process_id": "neq",
              "arguments": {
                "x": {
                  "from_parameter": "y"
                },
                "y": null
              }
            }
          }
        }
      },
      "position": [
        480,
        0
      ]
    },
    "5": {
      "process_id": "reduce_dimension",
      "arguments": {
        "data": {
          "from_node": "4"
        },
        "reducer": {
          "process_graph": {
            "1": {
              "process_id": "first",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                }
              },
              "result": true
            }
          }
        },
        "dimension": "t"
      },
      "result": true,
      "position": [
        720,
        0
      ]
    }
  },
  "parameters": [
    {
      "schema": {
        "type": "object",
        "subtype": "raster-cube",
        "title": "Raster data cube",
        "description": "A raster data cube, an image collection stored at the back-end. Different back-ends have different internal representations for this data structure."
      },
      "name": "data"
    },
    {
      "name": "criteria",
      "description": "A reducer to summarize the bands dimension, e.g. ``sum`` or ``mean()``.",
      "schema": {
        "type": "object",
        "subtype": "process-graph",
        "title": "User-defined process",
        "description": "An process graph that is passed as an argument and is expected to be executed by the process. Parameters passed to the process graph are specified in the `parameters` property of the corresponding schema."
      }
    },
    {
      "schema": {
        "type": "object",
        "subtype": "process-graph",
        "title": "User-defined process",
        "description": "An process graph that is passed as an argument and is expected to be executed by the process. Parameters passed to the process graph are specified in the `parameters` property of the corresponding schema.",
        "required": [
          "process_graph"
        ],
        "properties": {
          "process_graph": {
            "type": "object",
            "additionalProperties": {
              "type": "object",
              "required": [
                "process_id",
                "arguments"
              ],
              "properties": {
                "process_id": {
                  "type": "string"
                },
                "arguments": {}
              }
            }
          }
        }
      },
      "name": "timestep_selector",
      "description": "A process to select a timestep, e.g. `max`, `median`. Can't be a process that returns unobserved values, e.g. `mean`."
    }
  ],
  "id": "observed_pixel_composite",
  "description": "Create a composite over time that contains only observed combinations of band values. A single point in time is selected per pixel to fill in all band values."
}

A remaining problem is the inability to execute a process graph inside a process graph.

jdries commented 2 years ago

I've been looking at the process graphs a bit, and used the editor to get something more visual.

I don't entirely understand every aspect yet, but looks promising. Having the same in python code would also be nice. Relying on merge_cubes for data that comes from the same source can always be a performance issue, but perhaps we could try to measure that. In any case, we would need helper libraries to construct graphs like this, as they don't look trivial to come up with.

Nice work!

jonathom commented 2 years ago

Having the same in python code would also be nice.

I threw something together. It throws a TypeError (TypeError: Object of type 'function' is not JSON serializable) that I can't figure out, but it should help understand whats going on at the observed_pixel_composite.

from openeo.processes import array_find, array_modify, max, array_apply, constant, if_, neq, nan, first

# the reducer here is the parameter "summarize", used to summarize the bands dimension in a certain way, e.g. "max", "median", "mean"
cube_reduced_bands = cube.reduce_dimension(reducer = "sum", dimension = "bands")

def write_nan(x):
    return constant(nan)

def select_timesteps(data):
    selected = max(data) # This should be the parameter "timestep_selector", could be "max", "min", "median", cannot be "mean"
    selected_index = array_find(data, value = selected)
    empty_array = array_apply(data, process = write_nan) # create an array full of nans..
    # .. and write a value at the index where the "timestep_selector" identified it, e.g. the max
    selected_timesteps = array_modify(data = empty_array, values = selected, index = selected_index)
    return selected_timesteps

# create a cube that has nans at all timesteps except the ones selected by the "timestep_selector" process
cube_selected = cube_reduced_bands.apply_dimension(process = select_timesteps, dimension = "t")

def write_original_data(x, y):
    # if cube_s2_selected is != nan, then the original values should be written and null otherwise
    return if_(neq(y, nan), x)

# merge cubes and replace the placeholders with the actual data for the selected dates
cube_selected_original = cube.merge_cubes(cube_selected, overlap_resolver = write_original_data)

# reduce the time dimension full of nans by selection the only value present in that dimension
cube_composite = cube_selected_original.reduce_dimension(reducer = first, dimension = "t")

res = cube_composite.save_result(format = "GTiff")

When I tried to assign the functions to parameters and then call them,

summarize_bands = sum
timestep_selector = max

another TypeError occurred (TypeError: 'ProcessBuilder' object is not iterable), so I added descriptions where parameters should be.

jdries commented 1 year ago

My (working) proposal for this one is now available, basically using the ideas introduced here: https://github.com/Open-EO/openeo-community-examples/blob/rankComposite/python/RankComposites/rank_composites.ipynb

m-mohr commented 1 year ago

As this is working solely on existing processes, can we close this issue then? :-)

jdries commented 4 months ago

yes, closing!