Multi class pretraining

gabrieltseng commented 4 months ago

As discussed. All the balancing implemented in #46 is ignored now, and can be removed if we merge this in.

Task	Model	Head	F1	Recall	Precision
Crop vs. non crop	Crop vs. non crop	Random Forest	0.8661	0.8424.	0.8910
Crop vs. non crop	MultiClass	Random Forest	0.8552	0.8207	0.8927
Crop vs. non crop	MultiClass + `valid_date` as token	Random Forest	0.8555	0.8167	0.8982
Crop vs. non crop	Crop vs. non crop	Logistic Regression	0.8581	0.9098	0.8119
Crop vs. non crop	MultiClass	Logistic Regression	0.8179	0.8764	0.7667
Crop vs. non crop	MultiClass + `valid_date` as token	Logistic Regression	0.8189	0.8770	0.7681
Crop vs. non crop	Crop vs. non crop	CatBoost	0.8631	0.9076	0.8228
Crop vs. non crop	MultiClass	CatBoost	0.8567	0.8992	0.8181
Crop vs. non crop	Multiclass + `valid_date` as token	CatBoost	0.8591	0.8982	0.8232
Maize	Maize	Random Forest	0.8068	0.7455	0.8791
Maize	MultiClass	Random Forest	0.7548	0.6944	0.8266
Maize	MultiClass + `valid_date` as token	Random Forest	0.7663	0.7059	0.8380
Maize	Maize	Logistic Regression	0.7276	0.9207	0.6015
Maize	MultiClass	Logistic Regression	0.5730	0.9126	0.4176
Maize	MultiClass + `valid_date` as token	Logistic Regression	0.6034	0.9153	0.4500
Maize	Maize	CatBoost	0.7717	0.8912	0.6805
Maize	MultiClass	CatBoost	0.6472	0.8970	0.5063
Maize	MultiClass + `valid_date` as token	CatBoost	0.6560	0.9005	0.5159

kvantricht commented 4 months ago

I saw the change in e4d2979, and I was wondering about the strategy of dealing with the valid_month in training vs inference. It gives us an idea on the timing of the ground truth label, which is sometimes inferred from expert rules as the middle of the target season, and sometimes will be an exact observation date which could be start or end of the season and anything in between. At inference time, we don't have the observation date, only the season which we're trying to map. So would we then again try to feed the "center" month of the season to steer the embeddings to focus on that season?

gabrieltseng commented 4 months ago

So would we then again try to feed the "center" month of the season to steer the embeddings to focus on that season?

yes this was the flow I imagined, although I imagine it will require some iteration. What do you think?

kvantricht commented 4 months ago

yes this was the flow I imagined, although I imagine it will require some iteration. What do you think?

I think in general this is a good line of thought. I guess the valid_month in training not necessarily being in the center of the season could be seen as some sort of indirect augmentation so the model becomes less sensitive to the exact specification of the month as the real center. Already eager to see what feeding token does to the model! We definitely have to compare results with and without.

kvantricht commented 4 months ago

tagging @cbutsko so she can have some thoughts about this too!

gabrieltseng commented 4 months ago

One thing to be careful of is whether the valid_date could be a source of leakage in the validation set. It would probably make the most sense to fix it based on what crop we want to identify and what region the point is in.

kvantricht commented 4 months ago

One thing to be careful of is whether the valid_date could be a source of leakage in the validation set. It would probably make the most sense to fix it based on what crop we want to identify and what region the point is in.

Not entirely sure if I get what you mean, but this could then be our crop calendars. From Phase I we have for each AEZ start and end of season of up to three seasons. Note that in Phase II these will undergo significant changes (both the calendars and AEZ, this work is ongoing by university of Valencia), but for now we could work with Phase I data. We know for each sample in which AEZ it is located so we should overlap the valid_date with the seasons and for which season(s) we know the sample is valid. Then we can get valid_month (probably should find another name) based on those externally described season(s) and feed that as the token instead of the actual valid_date of the sample.

Am I getting it right?

gabrieltseng commented 4 months ago

Not entirely sure if I get what you mean,

I agree with the approach for extracting valid_date at inference time. My concern is about leakage via valid_date between the train and val sets, since (as best as I understand) it tells us when the data point was collected.

kvantricht commented 4 months ago

Not entirely sure if I get what you mean,

I agree with the approach for extracting valid_date at inference time. My concern is about leakage via valid_date between the train and val sets, since (as best as I understand) it tells us when the data point was collected.

Indeed, that's why I think for the validation data we need to compute the date inferred from the crop calendars (based on AEZ in which validation sample is located). This to me seems to be the fairest way of validating how the approach would work during inference when valid_date is not known. In the end, in training data we could also do it like this, with using valid_date to find the growing season(s) we are mapping.

gabrieltseng commented 3 months ago

Some more todos:

[ ] Train without location metadata
[ ] Use the crop calendars to truncate the time series

gabrieltseng commented 1 week ago

Closing - superseded by #71

WorldCereal / presto-worldcereal

Multi class pretraining #49