Test data leakage in step 05a (PyTorch and Tensorflow)

The two notebooks 05a - Deep Neural Networks (PyTorch).ipynb and 05a - Deep Neural Networks (TensorFlow).ipynb contain the following piece of code:

     # The dataset is too small to be useful for deep learning
     # So we'll oversample it to increase its size
     for i in range(1,3):
         penguins = penguins.append(penguins)

This creates a new dataframe that contains four copies of each row of the original dataframe. Since this happens before the training/test split, the probability of a row of the original dataframe to be present in both training and test set is approximately 0.75. In other words, one can expect 3/4 of the original rows to be present in both sets.

This constitutes a leakage of information from the test set into the training set, which renders the test set incapable of assessing the generalization capability of the trained model. In the case of the penguin toy dataset, this does not matter much: The three species appear to be well-separated in feature space, so that overfitting is not an immediate concern. Still, mixing training and test data is bad practice and should not be taught to ML beginners.

I therefore suggest the removal of the piece of code shown above. Since the model is no longer exposed to multiple copies of each row in one epoch of training, the number of epochs has to be increased to achieve the same test set accuracy. Training for 100 instead of 50 epochs worked well in my tests.

MicrosoftDocs / ml-basics

Test data leakage in step 05a (PyTorch and Tensorflow) #55