earthpulse / eotdl

Earth Observation Training Datasets
https://eotdl.com
MIT License
18 stars 6 forks source link

Feature Engineering with OpenEO - Use Case 1 #190

Open earthpulse opened 5 months ago

earthpulse commented 5 months ago

feature engineering for parcels in eurocrops (temporal aggregation on some indices, for example)

Patrick1G commented 1 month ago

@jdries @juansensio @jamesemwheeler Here is a more detailed specification of this use case:

As a user, I want to make use of the EuroCrops dataset in EOTDL, create a filtered subset (EOTDL functionality) and use openEO from within EOTDL to generate predictive features from S1 and S2 time series, then train a model in EOTDL, and use run inference with that model in CDSE.

  1. find and explore the EuroCropsDataset, stage it in the EODTL workspace
  2. filter the EuroCropsDataset dataset using EOTDL functionality, to create a subset of parcels, e.g., 8 crop classes, each with 1000 examples, for one country
  3. run feature engineering with openEO, creating temporal metrics from a S1 and S2 time series (temporally optimised for crops classe of interest). Store feature engineering process graph with the training datsets in EOTDL
  4. Use EOTDL functionality to train a model (for this the features need to be retrieved..). Store the model along with the openEO process graph in EOTDL.
  5. Use the model to run inference (from within EOTDL?) in an openEO platform such as CDSE or openEO platform. Make use of the feature engineering process graph stored along with the EOTDL model.
juansensio commented 3 weeks ago

Define the list of features that we want to compute for this task.

We can reuse the S1 and S2 pipelines from world cereal (features already validated).

HansVRP commented 3 weeks ago

Below I share an example on how we typically access custom STAC collections:

openeo-community-examples/python/LoadStac/load-stac-item-example.ipynb

HansVRP commented 3 weeks ago

The example provided in: https://github.com/earthpulse/eotdl/blob/main/tutorials/notebooks/forest-map.ipynb

Feels like a more natural approach and a workflow we could provide as well.

@juansensio could you clarify wheter you want openEO to acces the EuroCropsDataset or wheter we want to extract S1 and S2 data which match the spatio temporal bounds from the EuroCropsDataset?

I believe openEO would be better suited to:

1) select a region of interest 2) define a desired preprocessing methodology (save it as a process graph) 3) download the preprocessed data

4) Train the desired model on the data

4) combine the standardized preprocessing with the model to run inference\

HansVRP commented 2 weeks ago

@juansensio @Patrick1G any feedback on how best to steer this use-case?

juansensio commented 2 weeks ago

Patrick knows more about the use case, but as far as I understand the EuroCrops dataset contain crop classes for parcel polygons, so the goal would be to pair it with additional variables derived from S1/S2 (for example yearly mean NDVI).

openEO should be used to get this variables through a feature engineering pipleine, so we can use them to train a model and then re-use the pipeline at inference time.

Here we can delegate the entire process to openEO, or rely on EOTDL to retrieve the geometries from the STAC catalog and pass them to openEO... I guess the second option is better since we do not need openEO to access the dataset in EOTDL directly (just pass the resulting STAC catalog with geometries).

Patrick1G commented 2 weeks ago

@HansVRP @juansensio the use case is described in detail above: - lets follow those steps please

Next steps then:

Not quite sure how step#2 above should be done?: Eurocrops contains millions of parcel polygons, and to train a model we only need a subset, e.g. contrained to a country, selected crop types and random selection of n polygons within that selection. --- I don't tink openEO provides good functionality to do this, so it could be done in EOTDL with python libraries. As a first step, this could also be done offline.. To be discussed at next meeting..

HansVRP commented 2 weeks ago

okay already have a first version up on https://github.com/earthpulse/eotdl/tree/hv_openeoexample

Todo

@juansensio Does EOTDL has a dedicated cdse s3 storage which we can use to save the results into?

HansVRP commented 2 weeks ago

@Patrick1G @jdries

For S2 I used Best Available Pixel composites, which create St monthly composites with a minimum amount of clouds. Afterwards I calculated some typical features (percentiles) https://github.com/earthpulse/eotdl/blob/hv_openeoexample/tutorials/notebooks/openeo/generate_s3_UDP.py

For S1 I used a similar approach https://github.com/earthpulse/eotdl/blob/hv_openeoexample/tutorials/notebooks/openeo/generate_s1_UDP.py

Please let me know your thoughts

Patrick1G commented 2 weeks ago

@HansVRP resources above are not accessible..

But its important to keep the EO science aspects in mind here: we need to generate feature/metrics at a high temporal interval, as this is the critical information for crop type prediction, so 5/7 or 10 day interval metrics, not monthly BAP composites. Therefore I would suggest to use a similar feature engineering approach as above in the S1metrics notebook: {min, mean, mx, stddev, Q25, Q50, Q75, Q90} and generate this for e.g. 10 day interval for the year of the Eurocrops dataset

HansVRP commented 6 days ago

@Patrick1G @jdries please review the current version.

Here I used weekly composites of which I calculate the P10, P25, P50, P75, P90 percentiles.

The statistics can easily be expanded if required. However for now I kept them more limited as I run the statistics across 10 S2 bands, and 2 S1 bands; thereby already resulting in a netCDF with 60 bands.