WorldCereal / worldcereal-classification

This repository contains the classification module of the WorldCereal system.
https://esa-worldcereal.org/
MIT License
18 stars 2 forks source link

Investigate possible data distribution shifts between Phase I and Phase II #116

Closed kvantricht closed 3 weeks ago

kvantricht commented 1 month ago

Based on a well-known region (e.g. Flanders), we query existing public geoparquet for samples, make a subset of ~2500 points. Make use of scripts/extractions/point_extractions/point_extractions.py to extract Phase II preprocessed time series.

kvantricht commented 1 month ago

Image

kvantricht commented 1 month ago

More points and including density coloring:

Image

for now I'd say only Sentinel-1 is a bit suspicious. It's confirmed that we are not comparing exactly the same pixels, so scattering is definitely a possibility. Sentinel-1 though has a consistent bias. That means that without training on openEO-specific Sentinel-1 (Orfeo) is likely gonna lead to deteriorated model performance.

kvantricht commented 1 month ago

Some findings:

So when excluding 2017 for now, we continue with other findings:

Then, when training on Phase I embeddings (no 2017) and apply the CatBoost model to Phase II embeddings, there is a significant drop in performance. So while Presto embeddings seem to hold the same predictive power for Phase I and Phase II, they do have a signature related to the source of the data (Phase I vs Phase II) and training on one source and inferencing on the other does lead to a significant performance drop. The observed S1 bias could e.g. be one of the reasons for this.

kvantricht commented 3 weeks ago

Main issues were related fixes in #111. Although the S1 distribution shift likely contributes to prediction deterioration, training on Phase II data in the future should fix this. Closing as nothing more to be done for now.