Ashfinn / Diarrhea-Prediction-Model

This notebook presents a comprehensive analysis and machine learning prediction model based on datasets from five divisions in Bangladesh.
1 stars 0 forks source link

Overfitting Due to Trend Feature #1

Open DryBoss opened 4 days ago

DryBoss commented 4 days ago

The model is overfitting due to the inclusion of the trend feature, which is derived from the entire dataset, including the target variable. This results in misleading performance metrics as the model effectively "memorizes" these features instead of learning generalizable patterns.

The trend feature is calculated using the full dataset, including the target variable (cleaned_cases). The dataset is then split, and the model is trained with these features. Consequently, the model predicts almost perfectly because the trend and seasonal_strength features closely correlate with the target variable.

Relation between trend feature and cleaned_cases target variable

Note: The trend feature is essentially a processed version of the target variable cleaned_cases.


Problem


Suggested Solution

  1. Remove the trend feature entirely, OR
  2. Calculate these feature only on the training set to avoid exposing the test set to information about the target variable.
  3. Reassess model performance after applying these corrections to obtain a realistic evaluation of its predictive power.

Impact

The current preprocessing approach leads to:

By addressing this issue, the model will provide more reliable and meaningful predictions.

Ashfinn commented 4 days ago

Thanks for the feedback.. You can create a PR mentioning this issue with all the fixes.