cs109 / a-2017

Public Repository for cs109a, 2017 edition
http://cs109.github.io/a-2017
324 stars 461 forks source link

Standardization of test data in Lab 6 should use training mean and standard deviation #11

Open covuworie opened 6 years ago

covuworie commented 6 years ago

Observed behavior

Hi, there are bugs in classification-and-pca-lab.ipynb for Lab 6 in the do_classify and classify_from_dataframe methods. When standardizing the testing data, its mean and standard deviation are used. This is incorrect for several reasons such as:

Expected behavior

The training data mean and standard deviation should be used for standardizing the testing data like so:

dftest=(subdf.iloc[itest] - subdf.iloc[itrain].mean())/subdf.iloc[itrain].std()
Xte = (subdf.iloc[itest] - subdf.iloc[itrain].mean())/subdf.iloc[itrain].std()

I think this was mentioned in one of the earlier lectures and here are some more references:

pavlosprotopapas commented 6 years ago

Is it true though? I never managed to convince myself that this is not right to use the mean and std of the whole datasets. It is obvious that we should not use the test set mean and std but I never managed to prove that using the whole dataset is harmful (and I never seen a proof anywhere). It seems to be an accepted precaution. On the contrary, I have many examples that normalize/std-ize in the train and apply to rest can lead to many problems. Think a large dataset, where train (and test) are just a small subset. Pavlos

On Sat, Jul 21, 2018 at 10:50 AM covuworie notifications@github.com wrote:

Observed behavior

Hi, there is a bug in classification-and-pca-lab.ipynb https://github.com/cs109/a-2017/blob/master/Labs/Lab6_Classification_PCA/classification-and-pca-lab.ipynb for Lab 6 in the do_classify method. When standardizing the testing data, its mean and standard deviation are used. This is incorrect for several reasons such as:

  • No information from the testing data should be used in the model prediction as it is a form of data snooping. The testing dataset has been contaminated by this.
  • The same variable is not being created during the transformation of the training and testing sets

Expected behavior

The training data mean and standard deviation should be used for standardizing the testing data like so:

dftest=(subdf.iloc[itest] - subdf.iloc[itrain].mean())/subdf.iloc[itrain].std()

I think this was mentioned in one of the earlier lectures and here are some more references:

- https://stats.stackexchange.com/questions/202287/why-standardization-of-the-testing-set-has-to-be-performed-with-the-mean-and-sd

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cs109/a-2017/issues/11, or mute the thread https://github.com/notifications/unsubscribe-auth/AFwvU87oZVCBcaDxACPS4GtBbN8PKxARks5uI02vgaJpZM4VZq4N .

-- Pavlos Protopapas

Scientific Program Director, Institute for Applied Computational Science Harvard School of Engineering and Applied Sciences Maxwell Dworkin, 33 Oxford Street Cambridge, MA 02138 http://iacs.seas.harvard.edu/ pavlos@seas.harvard.edu | 617-496-2611

covuworie commented 6 years ago

Hi Pavlos,

Thanks for the response. Am I missing something here? As you say, "It is obvious that we should not use the test set mean and std". However, this is precisely the bug (notice the use of the itest indices) I am reporting since it is what is being done in cell 18 in the do_classify function:

itrain, itest = train_test_split(range(subdf.shape[0]), train_size=train_size)
if standardize:
    dftrain=(subdf.iloc[itrain] - subdf.iloc[itrain].mean())/subdf.iloc[itrain].std()
    dftest=(subdf.iloc[itest] - subdf.iloc[itest].mean())/subdf.iloc[itest].std()

The same is also done in cell 20 in the classify_from_dataframe function.

Now referring to whether it is correct to use use the mean and std deviation of the whole dataset. As the Sebastian Raschka link above says:

'Note that in practice, if the dataset is sufficiently large, we wouldn’t notice any substantial difference between the scenarios 1-3 because we assume that the samples have all been drawn from the same distribution.'

In this case there are only 212 observations in the training set and 142 observations in the test set which is not a lot (especially compared with 63 predictors).

I think the main point the various authors are making is one of data leakage / data snooping when the entire training set mean and std are used. The example that is used in the article mentioned above makes a lot of sense:

'Again, why Scenario 3? The reason is that we want to pretend that the test data is “new, unseen data.” We use the test dataset to get a good estimate of how our model performs on any new data. Now, in a real application, the new, unseen data could be just 1 data point that we want to classify. (How do we estimate mean and standard deviation if we have only 1 data point?) That’s an intuitive case to show why we need to keep and use the training data parameters for scaling the test set.'

Yes I agree that in practice it may not make much of a difference compared to using the training set mean and standard deviation if the sample size is large and they observations are drawn independently from the same distribution. Yes we could check this before deciding. But why even take the chance?

I think the answer to this question provides a great explanation and also links to further reputable resources which discuss the issue:

https://stats.stackexchange.com/questions/174823/how-to-apply-standardization-normalization-to-train-and-testset-if-prediction-i

Chuk