Closed cbutsko closed 5 months ago
Quite a few regions have a large number of samples where either no or multiple vegetation peaks were detected:
Here are a few notable examples. Zero peaks - start date of season is probably wrong here
Burkina Faso
Mexico
Mali
Kazakhstan
Nigeria
Sudan
Senegal
Uzbekistan
Zero peaks - vegetation pattern doesn't really look like crop, label error possible
Spain
Turkey
Multiple peaks - single croptype label with no explicit masking will likely result in model's confusion
Tanzania
Sri Lanka
Brazil
Burundi
Egypt
Italy
PROPOSAL:
Maize/not_maize setup When training models in a country LOO way, both finetuned Presto features and raw TS features fail to generalize spectacularly in ~50% of cases. Two major groups can be seen here:
One hypothesis can be that model overfits to a particular peak location (at least in case of single peak) and is unable to predict maize correctly. If this is so, one of the following can help:
With first approach, relative location of _validdate in relation to _seasonstart was added to training. While certain countries got performance boost, it didn't affect Argentina, Egypt, Rwanda, Ukraine...
Using aggregated features instead of time series (min, max, mean, std) gives much better performance for many regions (however, not all), which somewhat confirms the hypothesis:
Training only on samples that were confidently predicted as crop (p>0.7) gives very slight, but quite consistent boost. From all cropland datapoints, 23% were not confidently predicted as crop by best cropland model. This is the amount of data excluded from training in this case.
Detailed results here
PROPOSAL:
If generalization to new region is not happening, the next question is how many samples need to be added to train to boost performance. The following experiment was conducted: in a country LOO setup, maize and not_maize "injections" from omitted country were added to train set and performance change was measured. The following "injection" sizes were tested:
[1,10,20,30,40,50,100,200,250,300,350,400,450,500,550,600,650,700,750,800,900,1000,1100,1200]
The choice of sizes to test was a little intuitive, with many of sizes irrelevant for certain countries, where number of maize samples is ~200 points. Equal amounts of maize/not_maize samples were added to train.
From countries with poor performance defined in the previous comment, the following three are of most interest, as number of maize samples there is more than 1.5K points, and thus injection sizes constitute only a small part of the whole dataset: Argentina, Canada, Ukraine. For all of them, adding samples works extremely well. It can be seen that as little as ~200 points is enough to bring metrics to a decent level.
Following the analysis above, the following actions are proposed:
Remove samples from countries, where end-of-season date doesn't allow to capture valid vegetation season in a 12 month window bad_eos_coutries = ['BFA','MEX','MLI','KAZ','NGA','UZB','SDN','SEN','DMA']
.
Create the following validation splits:
Sanity check set. This set consists of a random 10% split of all data, stratified by crop_type label and country.
Seasonality capturing set. This set consists of the samples with various amounts of detected peaks. F1 scores should be considered with respect to the number of vegetation peaks detected in the NDVI time series (0, 1, >1). Several countries with clear multi-seasonal behavior were selected for this subset, they can be grouped in the following way: 1) Egypt, Brazil, Ethiopia: very clear double seasons, croptype datasets are highly concentrated in a small place, so spatial autocorrelation can skew results of adding "injections"; seemingly high quality of data sets and corresponding valid_dates 2) Italy: spatially extensive croptype data; double season behavior is observed in maize only (but it’s quite an extensive and important class); valid_dates influence is close to negligible, as most of them are June 1st 3) Rwanda: both single and double season maize is present, results are poor for both of them.
Spatial generalization set. This set consists of several “problematic” regions. Only a small fraction of samples from these regions are left in the main train set: 1-10% of all country samples, stratified by croptype label. They can also be removed from the train set to test performance in a completely new region. F1 scores should be computed with respect to the country. Countries for this set were selected like this: best model for crop/no_crop classification and 3-fold cross-validation procedure were used to mark every sample with crop/no_crop label; then F1 scores were computed per country and per dataset; then countries with lowest scores were selected (the assumption here is that labels are of good quality and they are just hard to classify, as no extra suspicious labels were used) more_difficult_countries = ['FIN','GRC','IDN','ITA','MAR','MDG','MOZ','PRT','SOM']
; among countries with both high and low F1 scores on different datasets, a few countries were selected to be included into this set: less_difficult_countries = ['ETH','ESP','AUT','BRA','TZA']
Here's what we start with:
And here are the hypothesis to test and things to do: