Overfitting Due to Trend Feature

The model is overfitting due to the inclusion of the trend feature, which is derived from the entire dataset, including the target variable. This results in misleading performance metrics as the model effectively "memorizes" these features instead of learning generalizable patterns.

The trend feature is calculated using the full dataset, including the target variable (cleaned_cases). The dataset is then split, and the model is trained with these features. Consequently, the model predicts almost perfectly because the trend and seasonal_strength features closely correlate with the target variable.

Relation between trend feature and cleaned_cases target variable

Note: The trend feature is essentially a processed version of the target variable cleaned_cases.

Problem

The trend feature is derived from the entire dataset, including the target variable (cleaned_cases), causing data leakage.
This leads to overfitting, where the model learns the target values too easily, resulting in biased performance metrics.

Impact

The current preprocessing approach leads to:

Overfitting, as the model relies heavily on features derived from the target variable.
Biased performance metrics that misrepresent the model's ability to generalize to new data.

By addressing this issue, the model will provide more reliable and meaningful predictions.

Ashfinn / Diarrhea-Prediction-Model

Overfitting Due to Trend Feature #1

Problem

Suggested Solution

Impact