Open pasquierjb opened 6 years ago
@lorenzori I changed the standardization (dividing by std) of the features to a max normalization (dividing by the max) to fix the problem of outliers in the features. The impact on R2 in Mali was minimum but this does not solve the problem of applying a different re-scaling between the evaluation and the scoring set.
At the moment the features are standardized before the evaluation loops (mean removal and dividing by variance) with the following:
data_features = (data_features - data_features.mean()) / data_features.std()
in master.pyAnd they are also normalized (mean removal and dividing by l2-norm) in each cross-validation fold with the following:
model = Ridge(normalize=True)
in modeller.pyThis is not optimal because:
Strangely for some configs (2000 for example) removing the normalization in the Ridge Regression impacts a lot the results (R2 from 20% to 0%)!
A possibility to implement more complexed transformations in cross-validation fold is to use the Pipeline class of sklearn. For example to perform scaling (between 0 and 1) and Ridge, we would do:
However, my attempts to combine Normalization and Ridge in a piepline have led to very different results compared to using the normalize=True argument of the Ridge regression...