WorldCereal / presto-worldcereal

10 stars 0 forks source link

Accept optional date in `process_parquet` for more precise subsetting #95

Open kvantricht opened 2 months ago

kvantricht commented 2 months ago

Currently we take end_date of training data and go back one year to subset training time series: https://github.com/WorldCereal/presto-worldcereal/blob/e8d5bbc173c581d197c3810fdcdf3a8768e9bc9a/presto/inference.py#L397-L410

However, if we train a dedicated CatBoost on a subset of data for a small AOI, we may benefit from subsetting based on the requested start_date and end_date (has to be one year) by the user. Can we adapt the method to accept an optional argument, e.g. end_date which - if given - dictates the subsetting of the timeseries?

We then need to be careful to be resilient to different years of the training data, and also drop samples that don't fall entirely within the requested time frame (adapted for the year of the sample).

kvantricht commented 3 weeks ago

@cbutsko I think we can close this? It's tackled on worldcereal-classification side?