Closed kvantricht closed 3 weeks ago
More points and including density coloring:
for now I'd say only Sentinel-1 is a bit suspicious. It's confirmed that we are not comparing exactly the same pixels, so scattering is definitely a possibility. Sentinel-1 though has a consistent bias. That means that without training on openEO-specific Sentinel-1 (Orfeo) is likely gonna lead to deteriorated model performance.
Some findings:
So when excluding 2017 for now, we continue with other findings:
Then, when training on Phase I embeddings (no 2017) and apply the CatBoost model to Phase II embeddings, there is a significant drop in performance. So while Presto embeddings seem to hold the same predictive power for Phase I and Phase II, they do have a signature related to the source of the data (Phase I vs Phase II) and training on one source and inferencing on the other does lead to a significant performance drop. The observed S1 bias could e.g. be one of the reasons for this.
Main issues were related fixes in #111. Although the S1 distribution shift likely contributes to prediction deterioration, training on Phase II data in the future should fix this. Closing as nothing more to be done for now.
Based on a well-known region (e.g. Flanders), we query existing public geoparquet for samples, make a subset of ~2500 points. Make use of scripts/extractions/point_extractions/point_extractions.py to extract Phase II preprocessed time series.