WFP-VAM / HRM

High Resolution Mapping of Food Security
https://wfp-vam.github.io/HRM/
MIT License
21 stars 6 forks source link

Scale, Normalize or Standardize ? #14

Open pasquierjb opened 6 years ago

pasquierjb commented 6 years ago

At the moment the features are standardized before the evaluation loops (mean removal and dividing by variance) with the following: data_features = (data_features - data_features.mean()) / data_features.std() in master.py

And they are also normalized (mean removal and dividing by l2-norm) in each cross-validation fold with the following: model = Ridge(normalize=True) in modeller.py

This is not optimal because:

Strangely for some configs (2000 for example) removing the normalization in the Ridge Regression impacts a lot the results (R2 from 20% to 0%)!

A possibility to implement more complexed transformations in cross-validation fold is to use the Pipeline class of sklearn. For example to perform scaling (between 0 and 1) and Ridge, we would do:

model = Ridge()
minmax_scaler = MinMaxScaler()
pipeline = make_pipeline(minmax_scaler, model)
scores = cross_val_score(pipeline, X, y)

However, my attempts to combine Normalization and Ridge in a piepline have led to very different results compared to using the normalize=True argument of the Ridge regression...

pasquierjb commented 5 years ago

@lorenzori I changed the standardization (dividing by std) of the features to a max normalization (dividing by the max) to fix the problem of outliers in the features. The impact on R2 in Mali was minimum but this does not solve the problem of applying a different re-scaling between the evaluation and the scoring set.