Investigate possible data distribution shifts between Phase I and Phase II

kvantricht commented 1 month ago

Based on a well-known region (e.g. Flanders), we query existing public geoparquet for samples, make a subset of ~2500 points. Make use of scripts/extractions/point_extractions/point_extractions.py to extract Phase II preprocessed time series.

[x] Save subset of Phase I parquet file
[x] Generate identical Phase II parquet file
[ ] Sanity checks on different input channels
[ ] Make identical CAL/VAL/TEST set for Phase I and Phase II
[ ] Already finetuned Presto: compute Phase I and Phase II embeddings
[ ] Train CatBoost on Phase I CAL/VAL
[ ] Compare trained CatBoost on Phase I TEST vs Phase II TEST
[ ] Train CatBoost on Phase II CAL/VAL
[ ] Deploy model and run inference for Flemish AOI and compare to Phase I CatBoost on this AOI

kvantricht commented 1 month ago

More points and including density coloring:

for now I'd say only Sentinel-1 is a bit suspicious. It's confirmed that we are not comparing exactly the same pixels, so scattering is definitely a possibility. Sentinel-1 though has a consistent bias. That means that without training on openEO-specific Sentinel-1 (Orfeo) is likely gonna lead to deteriorated model performance.

kvantricht commented 1 month ago

Some findings:

When training CatBoost on original (non-Presto) inputs for Phase I vs Phase II, results are close to each other when not considering 2017. So the same "amount of predictive signal" is present in both.
2017 is a weird year. Training and predicting only on this year gives significantly lower performance in Phase II, but only for optical data. There's no clear no data pattern or different band stats so for the moment we have no clue as to why this is the case

So when excluding 2017 for now, we continue with other findings:

Using self-supervised Presto that was trained on Phase I data, both CatBoost trainings on embeddings for Phase I and Phase II are close to each other. However, the performance is quite worse compared to CatBoost on original preprocessed inputs (no Presto).
Using finetuned Presto trained on Phase I data, CatBoost training on Phase II embeddings actually performs even a bit better. So this confirms that for this case indeed 2017 is somehow the year that has detrimental effect on performance.

Then, when training on Phase I embeddings (no 2017) and apply the CatBoost model to Phase II embeddings, there is a significant drop in performance. So while Presto embeddings seem to hold the same predictive power for Phase I and Phase II, they do have a signature related to the source of the data (Phase I vs Phase II) and training on one source and inferencing on the other does lead to a significant performance drop. The observed S1 bias could e.g. be one of the reasons for this.

kvantricht commented 3 weeks ago

Main issues were related fixes in #111. Although the S1 distribution shift likely contributes to prediction deterioration, training on Phase II data in the future should fix this. Closing as nothing more to be done for now.

WorldCereal / worldcereal-classification

Investigate possible data distribution shifts between Phase I and Phase II #116