ML Data Cube Regularization

PondiB commented 1 year ago

Regularized datacubes are a necessity for machine learning and deep learning in EO time series data. This process aims to eliminate the need for a user chaining processes to have a consistent data cube

PondiB commented 1 year ago

@m-mohr , I am seeking your eyes whenever you get to have a moment as I have fixed most failures but I am taking way longer to trace this.

m-mohr commented 1 year ago

fyi: I won't get to it anytime soon, sorry.

PondiB commented 1 year ago

fyi: I won't get to it anytime soon, sorry.

Thanks for getting back. It's fine. I'll figure it out soon.

soxofaan commented 10 months ago

I'm not sure I understand why this process is necessary. The description talks about "irregular" but if your data is in a openEO data cube, then it's pretty regular already. Your time instants could be spaced unevenly, but that doesn't mean that an ML model could not handle that.

This process looks like a combination between aggregate_temporal_period and resample_spatial, but:

aggregate_temporal_period uses a different period specification format
aggregate_temporal_period has a reducer argument which ml_regularize_data_cube is missing I guess
resample_spatial has projection and method arguments (and some more) which are also missing here

In this state, I think ml_regularize_data_cube is missing quite some parameters.

more generally: is there a compelling reason to define ml_regularize_data_cube, if we already have aggregate_temporal_period and resample_spatial?

jdries commented 10 months ago

The use case has even been explored quite extensively in openEO platform, and made it into public examples:

https://github.com/Open-EO/openeo-community-examples/blob/main/python/BasicSentinelMerge/sentinel_merge.ipynb https://github.com/openEOPlatform/openeo-classification/blob/main/src/openeo_classification/features.py#L117

PondiB commented 10 months ago

@soxofaan thanks for the feedback, on the OEMC project we are planning to come up with a new openEO backend with a more focus on ML and DL capabilities for Satellite Image Time Series.

Regular data cube in our case encompasses: (a) there is a unique field function; (b) the spatial support is georeferenced; (c) temporal continuity is assured; and (d) all spatiotemporal locations share the same set of attributes, and (e) there are no gaps or missing values in the spatiotemporal extent.

In our discussion, there were philosophies as shown in the image below and we would like to support both i.e. (1) allowing users to define their processes before ML/DL operations and (2)not bothering the users with underlying processes. Screenshot 2023-09-25 at 14 54 41

@jdries cool, I will check out the examples.

jdries commented 10 months ago

Nice, this is exactly what I happen to be working on for the moment, in support of a couple of projects using ML.

Maybe you already know, but openEO has a mechanism to build this kind of convenience function that is a combination of existing processes, the openEO 'user defined processes' (UDP). Using this has a couple of advantages:

The process definition is very formal, and falls back to the definition of the individual processes, so less specification work to be done.
Backends that support the individual processes can easily support the convenience process, even without requiring explicit implementation. This is extremely important if we want to reach the goal of cross-backend compatibility.
Backends that do not support the individual processes, can still support the convenience process.
If you want a special (e.g. faster) implementation of the convenience process, that's also possible.

I see this case arising more often, so maybe we can create an open source github repo, with the definitions of these UDP's. That would allow users to reference the central repo, or allow backends to import those definitions.

Now about the actual process:

spatial regularization is something that openEO already allows to do by default, without requiring any process. If a user loads a mix Sentinel-2 bands at different resolution, we for instance return a datacube with the right UTM zone as projection system, and the highest resolution. So not sure if we need this.
cloud masking is tricky, and unfortunately still needs sensor specific implementations to do it right. Not sure how that would work with a convenience process? The most generic approach I can think of is some kind of binarized cloudmask, and then using a 'distance to cloud' metric in the compositing. The sits regularize (1) method mentions sorting images by cloud percentage, but I'm not sure how this translates to openEO datacubes.
there's different methods possible to select the best available pixel from a given compositing interval. The most optimal choice somewhat depends on the length of the interval, and number of observations per interval. A method that's relatively generic is using distance to the middle of the interval, combined with distance to cloud. It has the advantage over (1) that you try to ensure that the actually selected observations are spaced evenly in time as much as possible.

(1) https://rdrr.io/cran/sits/man/sits_regularize.html

m-mohr commented 8 months ago

@PondiB I think it would make sense to make PRs against the ml branch because otherwise all changes from the ML branch will also appear in this PR. This leads to confusion. Please rebase your changes against the ML branch if necessary and set the base branch of the PR to ml.

PondiB commented 8 months ago

@PondiB I think it would make sense to make PRs against the ml branch because otherwise all changes from the ML branch will also appear in this PR. This leads to confusion. Please rebase your changes against the ML branch if necessary and set the base branch of the PR to ml.

Sure.

Open-EO / openeo-processes

ML Data Cube Regularization #444