BigDataWUR / AgML-CY-Bench

CY-Bench (Crop Yield Benchmark) is a comprehensive dataset and benchmark to forecast crop yields at subnational level. CY-Bench standardizes selection, processing and spatio-temporal harmonization of public subnational yield statistics with relevant predictors. Contributors include agronomers, climate scientists and machine learning researchers.
https://cybench.agml.org/
Other
13 stars 6 forks source link

detrending options and criteria for selecting one #82

Open krsnapaudel opened 6 months ago

krsnapaudel commented 6 months ago
krsnapaudel commented 5 months ago

Suggestion from @ritviksahajpal : Use harvest year as a feature.

WestonAnderson commented 5 months ago

As Ritvik suggested, we can leave the identification of the trend up to the modelers by e.g. including the harvest year as a feature.

Importantly, I'd advocate for including model assessment metrics that have the trend removed. For example, we can remove the observed trend from both the observations and the predictions after the cross-validation predictions are made then calculate the mean average error based on the yield anomalies. The reason for doing this would be to separate the value being added to the model from the predicted location-specific trend from the value added by the environmental variables. When considering the variance of the absolute values without detrending, the vast majority of the variance will be a result of the trend in locations with strong trends. What method we use to detrend is of secondary importance, or as Dilli has done in the past, we can compare against a benchmark linear trend model.

WestonAnderson commented 5 months ago

I'm looking into the trend question as well because I'm not clear that including the year in the model would prevent information leakage in a leave-one-out cross validation.

WestonAnderson commented 5 months ago

Gouache et al. (2015) used the year as a covariate in their models, but found vastly different model performance in the leave-one-out cross validation vs when predicting future years in an independent dataset. This may be because it sounds like they violated the cross validation in their variable selection procedure, but it may also raise the question about the model using future information even when including the year as a predictor. From this, I'd suggest that in addition to the leave-one-out cross validation we withhold a small number of years from the dataset to evaluate model performance using both approaches. This will be important to assess information leakage due to trends when using different model evaluation metrics

krsnapaudel commented 1 month ago

As discussed via email, LOO and trend estimation will probably lead to information leakage. If we want to detrend during data preparation, here is an approach. This approach focuses on avoiding data loss. @WestonAnderson, @ritviksahajpal, @mmeronijrc, @mzachow : Please share your thoughts here or via email. Thanks.

import numpy as np
import pandas as pd
from statsmodels.regression.linear_model import OLS
from statsmodels.tools.tools import add_constant
import pymannkendall as trend_mk

def estimate_trend(years, values, target_year):
    assert len(values) >= 5

    trend_x = add_constant(years)
    linear_trend_est = OLS(values, trend_x).fit()

    pred_x = np.array([target_year]).reshape((1, 1))
    pred_x = add_constant(pred_x, has_constant="add")

    return linear_trend_est.predict(pred_x)

def find_optimal_trend_window(years_values, window_years, extend_forward=False):
    min_p = float("inf")
    opt_trend_years = None
    for i in range(5, min(10, len(window_years)) + 1):
        # should the search window be extended forward, i.e. towards later years
        if (extend_forward):
            trend_x = window_years[:i]
        else:
            trend_x = window_years[-i:]

        trend_y = years_values[np.in1d(years_values[:, 0], trend_x)][:, 1]
        # print(sel_yr, trend_x, trend_y)
        result = trend_mk.original_test(trend_y)

        # select based on p-value, lower the better
        if result.h and (result.p < min_p):
            min_p = result.p
            opt_trend_years = trend_x

    return opt_trend_years

def detrend_values(df, value_col="yield"):
    adm_id = df["adm_id"].unique()[0]
    original_values = df[["year", value_col]].values
    trend_values = np.zeros(original_values.shape)
    years = sorted(df["year"].unique())
    for i, sel_yr in enumerate(years):
        lt_sel_yr = [yr for yr in years if yr < sel_yr]
        gt_sel_yr = [yr for yr in years if yr > sel_yr]

        # Case 1: Not enough years to estimate trend
        if (len(lt_sel_yr) < 5) and (len(gt_sel_yr) < 5):
            trend = original_values[:, 1].mean()
        else:
            trend = None
            # Case 2: Estimate trend using years before
            window_years = find_optimal_trend_window(original_values, lt_sel_yr, extend_forward=False)
            if (window_years is not None):
                window_values = original_values[np.in1d(original_values[:, 0], window_years)][:, 1]
                trend = estimate_trend(window_years, window_values, sel_yr)[0]
            else:
                # Case 3: Estimate trend using years after
                window_years = find_optimal_trend_window(original_values, gt_sel_yr, extend_forward=True)
                if (window_years is not None):
                    window_values = original_values[np.in1d(original_values[:, 0], window_years)][:, 1]
                    trend = estimate_trend(window_years, window_values, sel_yr)[0]

            # Case 4: No significant trend exists
            if (trend is None):
                trend = original_values[:, 1].mean()

        # print(adm_id, sel_yr, trend, sel_case)
        trend_values[i, 0] = sel_yr
        trend_values[i, 1] = trend

    trend_cols = ["year", value_col + "_trend"]
    trend_df = pd.DataFrame(trend_values, columns=trend_cols)
    trend_df["adm_id"] = adm_id
    trend_df = trend_df.astype({"year" : int})
    detrended_df = trend_df.merge(df, on=["adm_id", "year"])
    detrended_df[value_col + "_res"] = detrended_df[value_col] - detrended_df[value_col + "_trend"]
    detrended_df = detrended_df[["adm_id", "year", "yield", "yield_trend", "yield_res"]]

    # print(detrended_df.head(20))
    return detrended_df

yield_df = pd.read_csv("data/maize/NL/yield_maize_NL.csv")
yield_df = yield_df[yield_df["harvest_year"] >= 2001]
yield_df = yield_df.rename(columns={"harvest_year" : "year"})
yield_df = yield_df[["adm_id", "year", "yield"]]
print(len(yield_df.index))

admin_ids = yield_df["adm_id"].unique()
yield_res_df = pd.DataFrame()
for adm_id in admin_ids:
    yield_sel_df = yield_df[yield_df["adm_id"] == adm_id]
    trend_df = detrend_values(yield_sel_df)
    yield_res_df = pd.concat([yield_res_df, trend_df], axis=0)

print(len(yield_res_df.index))
print(yield_res_df.head(20))
mzachow commented 1 month ago

Thanks Dilli! The more I think about it, the more I wonder about the operational usability of a residual model. The MARS bulletins and USDA WASDE reports for example compare absolute yield to last year's output or 3 years-average such that users with different recency biases and levels of anticipation can evaluate if this year's output translates into a positive or negative yield anomaly. @WestonAnderson I saw you mentioning in the e-mail discussion practical usability of residual models. Could you share more information on that?

If there is little to no operational usability for a residual model, we shouldn't be so concerned about data leakage for the calculation of the target data right? It's just a "toy problem" that we define to improve residual model development. The better the residuals are dissected from the trend, the more we facilitate the model development. So detrending across the entire time series by fitting a slope may be acceptable.

WestonAnderson commented 1 month ago

Residual model: FEWS NET is interested in a residual model for two reasons: 1. They are looking to isolate the effects of climate from other impacts. They use other sources of information to develop expectations about how socioeconomic factors influence yields. 2. They apply a ‘deviation from baseline’, modeling approach in a wider food security modeling framework (household economy approach). This modeling effort would require translating even absolute yield forecast into a deviation from some definition of normal.

The one point I’d make about the trend vs residuals is that we again need to clearly define our interest. If we are specifically interested in the influence of weather on yields, we need to recognize that for an absolute yield model it will be impossible to separate model skill related to the accurate prediction of a non-climate trend from the skill related to predicting the influence of climate. The best model may not include any climate variables whatsoever. If we are simply interested in the best model, that is okay

krsnapaudel commented 1 month ago

@WestonAnderson, @ritviksahajpal, @mmeronijrc, @mzachow: Let's move this discussion to the Discussion Forum: https://github.com/BigDataWUR/AgML-CY-Bench/discussions/292