fchollet / deep-learning-with-python-notebooks

Jupyter notebooks for the code samples of the book "Deep Learning with Python"
MIT License
18.7k stars 8.66k forks source link

3.7-predicting-house-prices: Potential contamination of the validation data #24

Open tomsing1 opened 6 years ago

tomsing1 commented 6 years ago

In section 3.7, the numerical predictors are centered and scaled (separately for the training and test data) - outside the k-fold cross validation loops. Only afterward is the training data split further into training and validation subsets when the k-fold cross-validation is set up.

I believe it is highly recommended to perform all data-dependent transformations within the cross-validation loop. (See this blog post for additional information.)

While this may not affect performance in this case, performing data-dependent transformations outside the cross-validation loop is potentially dangerous. It allows for information learned from the full training data to leak across the cross-validation folds. (See e.g. this example using the Boston housing dataset.)

As the audience of this book includes beginners in the field of machine learning, it would be good to point out this potential pitfall (or, even better, to move this step into the cross-validation loop).

yf704475209 commented 6 years ago

Appreciate your point! It's really helpful!