3.7-predicting-house-prices: Potential contamination of the validation data

In section 3.7, the numerical predictors are centered and scaled (separately for the training and test data) - outside the k-fold cross validation loops. Only afterward is the training data split further into training and validation subsets when the k-fold cross-validation is set up.

I believe it is highly recommended to perform all data-dependent transformations within the cross-validation loop. (See this blog post for additional information.)

While this may not affect performance in this case, performing data-dependent transformations outside the cross-validation loop is potentially dangerous. It allows for information learned from the full training data to leak across the cross-validation folds. (See e.g. this example using the Boston housing dataset.)

As the audience of this book includes beginners in the field of machine learning, it would be good to point out this potential pitfall (or, even better, to move this step into the cross-validation loop).

fchollet / deep-learning-with-python-notebooks

3.7-predicting-house-prices: Potential contamination of the validation data #24