Open krsnapaudel opened 6 months ago
Suggestion from @ritviksahajpal : Use harvest year as a feature.
As Ritvik suggested, we can leave the identification of the trend up to the modelers by e.g. including the harvest year as a feature.
Importantly, I'd advocate for including model assessment metrics that have the trend removed. For example, we can remove the observed trend from both the observations and the predictions after the cross-validation predictions are made then calculate the mean average error based on the yield anomalies. The reason for doing this would be to separate the value being added to the model from the predicted location-specific trend from the value added by the environmental variables. When considering the variance of the absolute values without detrending, the vast majority of the variance will be a result of the trend in locations with strong trends. What method we use to detrend is of secondary importance, or as Dilli has done in the past, we can compare against a benchmark linear trend model.
I'm looking into the trend question as well because I'm not clear that including the year in the model would prevent information leakage in a leave-one-out cross validation.
Gouache et al. (2015) used the year as a covariate in their models, but found vastly different model performance in the leave-one-out cross validation vs when predicting future years in an independent dataset. This may be because it sounds like they violated the cross validation in their variable selection procedure, but it may also raise the question about the model using future information even when including the year as a predictor. From this, I'd suggest that in addition to the leave-one-out cross validation we withhold a small number of years from the dataset to evaluate model performance using both approaches. This will be important to assess information leakage due to trends when using different model evaluation metrics
As discussed via email, LOO and trend estimation will probably lead to information leakage. If we want to detrend during data preparation, here is an approach. This approach focuses on avoiding data loss. @WestonAnderson, @ritviksahajpal, @mmeronijrc, @mzachow : Please share your thoughts here or via email. Thanks.
import numpy as np
import pandas as pd
from statsmodels.regression.linear_model import OLS
from statsmodels.tools.tools import add_constant
import pymannkendall as trend_mk
def estimate_trend(years, values, target_year):
assert len(values) >= 5
trend_x = add_constant(years)
linear_trend_est = OLS(values, trend_x).fit()
pred_x = np.array([target_year]).reshape((1, 1))
pred_x = add_constant(pred_x, has_constant="add")
return linear_trend_est.predict(pred_x)
def find_optimal_trend_window(years_values, window_years, extend_forward=False):
min_p = float("inf")
opt_trend_years = None
for i in range(5, min(10, len(window_years)) + 1):
# should the search window be extended forward, i.e. towards later years
if (extend_forward):
trend_x = window_years[:i]
else:
trend_x = window_years[-i:]
trend_y = years_values[np.in1d(years_values[:, 0], trend_x)][:, 1]
# print(sel_yr, trend_x, trend_y)
result = trend_mk.original_test(trend_y)
# select based on p-value, lower the better
if result.h and (result.p < min_p):
min_p = result.p
opt_trend_years = trend_x
return opt_trend_years
def detrend_values(df, value_col="yield"):
adm_id = df["adm_id"].unique()[0]
original_values = df[["year", value_col]].values
trend_values = np.zeros(original_values.shape)
years = sorted(df["year"].unique())
for i, sel_yr in enumerate(years):
lt_sel_yr = [yr for yr in years if yr < sel_yr]
gt_sel_yr = [yr for yr in years if yr > sel_yr]
# Case 1: Not enough years to estimate trend
if (len(lt_sel_yr) < 5) and (len(gt_sel_yr) < 5):
trend = original_values[:, 1].mean()
else:
trend = None
# Case 2: Estimate trend using years before
window_years = find_optimal_trend_window(original_values, lt_sel_yr, extend_forward=False)
if (window_years is not None):
window_values = original_values[np.in1d(original_values[:, 0], window_years)][:, 1]
trend = estimate_trend(window_years, window_values, sel_yr)[0]
else:
# Case 3: Estimate trend using years after
window_years = find_optimal_trend_window(original_values, gt_sel_yr, extend_forward=True)
if (window_years is not None):
window_values = original_values[np.in1d(original_values[:, 0], window_years)][:, 1]
trend = estimate_trend(window_years, window_values, sel_yr)[0]
# Case 4: No significant trend exists
if (trend is None):
trend = original_values[:, 1].mean()
# print(adm_id, sel_yr, trend, sel_case)
trend_values[i, 0] = sel_yr
trend_values[i, 1] = trend
trend_cols = ["year", value_col + "_trend"]
trend_df = pd.DataFrame(trend_values, columns=trend_cols)
trend_df["adm_id"] = adm_id
trend_df = trend_df.astype({"year" : int})
detrended_df = trend_df.merge(df, on=["adm_id", "year"])
detrended_df[value_col + "_res"] = detrended_df[value_col] - detrended_df[value_col + "_trend"]
detrended_df = detrended_df[["adm_id", "year", "yield", "yield_trend", "yield_res"]]
# print(detrended_df.head(20))
return detrended_df
yield_df = pd.read_csv("data/maize/NL/yield_maize_NL.csv")
yield_df = yield_df[yield_df["harvest_year"] >= 2001]
yield_df = yield_df.rename(columns={"harvest_year" : "year"})
yield_df = yield_df[["adm_id", "year", "yield"]]
print(len(yield_df.index))
admin_ids = yield_df["adm_id"].unique()
yield_res_df = pd.DataFrame()
for adm_id in admin_ids:
yield_sel_df = yield_df[yield_df["adm_id"] == adm_id]
trend_df = detrend_values(yield_sel_df)
yield_res_df = pd.concat([yield_res_df, trend_df], axis=0)
print(len(yield_res_df.index))
print(yield_res_df.head(20))
Thanks Dilli! The more I think about it, the more I wonder about the operational usability of a residual model. The MARS bulletins and USDA WASDE reports for example compare absolute yield to last year's output or 3 years-average such that users with different recency biases and levels of anticipation can evaluate if this year's output translates into a positive or negative yield anomaly. @WestonAnderson I saw you mentioning in the e-mail discussion practical usability of residual models. Could you share more information on that?
If there is little to no operational usability for a residual model, we shouldn't be so concerned about data leakage for the calculation of the target data right? It's just a "toy problem" that we define to improve residual model development. The better the residuals are dissected from the trend, the more we facilitate the model development. So detrending across the entire time series by fitting a slope may be acceptable.
Residual model: FEWS NET is interested in a residual model for two reasons: 1. They are looking to isolate the effects of climate from other impacts. They use other sources of information to develop expectations about how socioeconomic factors influence yields. 2. They apply a ‘deviation from baseline’, modeling approach in a wider food security modeling framework (household economy approach). This modeling effort would require translating even absolute yield forecast into a deviation from some definition of normal.
The one point I’d make about the trend vs residuals is that we again need to clearly define our interest. If we are specifically interested in the influence of weather on yields, we need to recognize that for an absolute yield model it will be impossible to separate model skill related to the accurate prediction of a non-climate trend from the skill related to predicting the influence of climate. The best model may not include any climate variables whatsoever. If we are simply interested in the best model, that is okay
@WestonAnderson, @ritviksahajpal, @mmeronijrc, @mzachow: Let's move this discussion to the Discussion Forum: https://github.com/BigDataWUR/AgML-CY-Bench/discussions/292