Closed gabrieltseng closed 1 week ago
I saw the change in e4d2979, and I was wondering about the strategy of dealing with the valid_month
in training vs inference. It gives us an idea on the timing of the ground truth label, which is sometimes inferred from expert rules as the middle of the target season, and sometimes will be an exact observation date which could be start or end of the season and anything in between. At inference time, we don't have the observation date, only the season which we're trying to map. So would we then again try to feed the "center" month of the season to steer the embeddings to focus on that season?
So would we then again try to feed the "center" month of the season to steer the embeddings to focus on that season?
yes this was the flow I imagined, although I imagine it will require some iteration. What do you think?
yes this was the flow I imagined, although I imagine it will require some iteration. What do you think?
I think in general this is a good line of thought. I guess the valid_month
in training not necessarily being in the center of the season could be seen as some sort of indirect augmentation so the model becomes less sensitive to the exact specification of the month as the real center. Already eager to see what feeding token does to the model! We definitely have to compare results with and without.
tagging @cbutsko so she can have some thoughts about this too!
One thing to be careful of is whether the valid_date
could be a source of leakage in the validation set. It would probably make the most sense to fix it based on what crop we want to identify and what region the point is in.
One thing to be careful of is whether the
valid_date
could be a source of leakage in the validation set. It would probably make the most sense to fix it based on what crop we want to identify and what region the point is in.
Not entirely sure if I get what you mean, but this could then be our crop calendars. From Phase I we have for each AEZ start and end of season of up to three seasons. Note that in Phase II these will undergo significant changes (both the calendars and AEZ, this work is ongoing by university of Valencia), but for now we could work with Phase I data. We know for each sample in which AEZ it is located so we should overlap the valid_date
with the seasons and for which season(s) we know the sample is valid. Then we can get valid_month
(probably should find another name) based on those externally described season(s) and feed that as the token instead of the actual valid_date
of the sample.
Am I getting it right?
Not entirely sure if I get what you mean,
I agree with the approach for extracting valid_date
at inference time. My concern is about leakage via valid_date
between the train and val sets, since (as best as I understand) it tells us when the data point was collected.
Not entirely sure if I get what you mean,
I agree with the approach for extracting
valid_date
at inference time. My concern is about leakage viavalid_date
between the train and val sets, since (as best as I understand) it tells us when the data point was collected.
Indeed, that's why I think for the validation data we need to compute the date inferred from the crop calendars (based on AEZ in which validation sample is located). This to me seems to be the fairest way of validating how the approach would work during inference when valid_date
is not known. In the end, in training data we could also do it like this, with using valid_date
to find the growing season(s) we are mapping.
Some more todos:
Closing - superseded by #71
As discussed. All the balancing implemented in #46 is ignored now, and can be removed if we merge this in.
valid_date
as tokenvalid_date
as tokenvalid_date
as tokenvalid_date
as tokenvalid_date
as tokenvalid_date
as token