WorldCereal / worldcereal-classification

This repository contains the classification module of the WorldCereal system.
https://esa-worldcereal.org/
MIT License
18 stars 2 forks source link

Setup experiments on seasons handling for multiclass classification #13

Closed cbutsko closed 5 months ago

cbutsko commented 8 months ago

Here's what we start with:

And here are the hypothesis to test and things to do:

cbutsko commented 7 months ago

Quite a few regions have a large number of samples where either no or multiple vegetation peaks were detected: image

Here are a few notable examples. Zero peaks - start date of season is probably wrong here

Burkina Faso cropland_BFA_peaks=0

Mexico cropland_MEX_peaks=0

Mali cropland_MLI_peaks=0

Kazakhstan cropland_KAZ_peaks=0

Nigeria cropland_NGA_peaks=0

Sudan cropland_SDN_peaks=0

Senegal cropland_SEN_peaks=0

Uzbekistan cropland_UZB_peaks=0

Zero peaks - vegetation pattern doesn't really look like crop, label error possible

Spain cropland_ESP_peaks=0

Turkey cropland_TUR_peaks=0

Multiple peaks - single croptype label with no explicit masking will likely result in model's confusion

Tanzania cropland_TZA_peaks=2

Sri Lanka cropland_LKA_peaks=2

Brazil cropland_BRA_peaks=2

Burundi cropland_BDI_peaks=2

Egypt cropland_EGY_peaks=2

Italy cropland_ITA_peaks=3

PROPOSAL:

  1. Exclude samples with clearly mismatched season start date from training and validation
  2. Use a subset of samples with clear multiple seasons as a separate validation set
  3. Communicate these findings to Valencia
cbutsko commented 7 months ago

Maize/not_maize setup When training models in a country LOO way, both finetuned Presto features and raw TS features fail to generalize spectacularly in ~50% of cases. Two major groups can be seen here:

  1. All samples in the country show equally poor performance: Argentina, Brazil, Canada 😪, Egypt, Latvia, Mozambique, Rwanda, Uganda, Ukraine 😪
  2. Labels with 0 detected peaks show significantly worse performance in: Austria, Germany, Spain, France, Italy, USA !!This can be a signal for bad labels!!

One hypothesis can be that model overfits to a particular peak location (at least in case of single peak) and is unable to predict maize correctly. If this is so, one of the following can help:

With first approach, relative location of _validdate in relation to _seasonstart was added to training. While certain countries got performance boost, it didn't affect Argentina, Egypt, Rwanda, Ukraine... image

Using aggregated features instead of time series (min, max, mean, std) gives much better performance for many regions (however, not all), which somewhat confirms the hypothesis: image

Training only on samples that were confidently predicted as crop (p>0.7) gives very slight, but quite consistent boost. From all cropland datapoints, 23% were not confidently predicted as crop by best cropland model. This is the amount of data excluded from training in this case. image

Detailed results here

PROPOSAL:

  1. Exclude samples that have cropland label, but are not confidently predicted as cropland by best cropland model, from training
  2. Exclude these samples from curated test set too
cbutsko commented 7 months ago

If generalization to new region is not happening, the next question is how many samples need to be added to train to boost performance. The following experiment was conducted: in a country LOO setup, maize and not_maize "injections" from omitted country were added to train set and performance change was measured. The following "injection" sizes were tested: [1,10,20,30,40,50,100,200,250,300,350,400,450,500,550,600,650,700,750,800,900,1000,1100,1200] The choice of sizes to test was a little intuitive, with many of sizes irrelevant for certain countries, where number of maize samples is ~200 points. Equal amounts of maize/not_maize samples were added to train.

From countries with poor performance defined in the previous comment, the following three are of most interest, as number of maize samples there is more than 1.5K points, and thus injection sizes constitute only a small part of the whole dataset: Argentina, Canada, Ukraine. For all of them, adding samples works extremely well. It can be seen that as little as ~200 points is enough to bring metrics to a decent level. image

cbutsko commented 5 months ago

Following the analysis above, the following actions are proposed:

  1. Remove samples from countries, where end-of-season date doesn't allow to capture valid vegetation season in a 12 month window bad_eos_coutries = ['BFA','MEX','MLI','KAZ','NGA','UZB','SDN','SEN','DMA'].

  2. Create the following validation splits:

    • Sanity check set. This set consists of a random 10% split of all data, stratified by crop_type label and country.

    • Seasonality capturing set. This set consists of the samples with various amounts of detected peaks. F1 scores should be considered with respect to the number of vegetation peaks detected in the NDVI time series (0, 1, >1). Several countries with clear multi-seasonal behavior were selected for this subset, they can be grouped in the following way: 1) Egypt, Brazil, Ethiopia: very clear double seasons, croptype datasets are highly concentrated in a small place, so spatial autocorrelation can skew results of adding "injections"; seemingly high quality of data sets and corresponding valid_dates 2) Italy: spatially extensive croptype data; double season behavior is observed in maize only (but it’s quite an extensive and important class); valid_dates influence is close to negligible, as most of them are June 1st 3) Rwanda: both single and double season maize is present, results are poor for both of them.

    • Spatial generalization set. This set consists of several “problematic” regions. Only a small fraction of samples from these regions are left in the main train set: 1-10% of all country samples, stratified by croptype label. They can also be removed from the train set to test performance in a completely new region. F1 scores should be computed with respect to the country. Countries for this set were selected like this: best model for crop/no_crop classification and 3-fold cross-validation procedure were used to mark every sample with crop/no_crop label; then F1 scores were computed per country and per dataset; then countries with lowest scores were selected (the assumption here is that labels are of good quality and they are just hard to classify, as no extra suspicious labels were used) more_difficult_countries = ['FIN','GRC','IDN','ITA','MAR','MDG','MOZ','PRT','SOM']; among countries with both high and low F1 scores on different datasets, a few countries were selected to be included into this set: less_difficult_countries = ['ETH','ESP','AUT','BRA','TZA']