BME1478H / Fall2022class

37 stars 655 forks source link

Assignment 11 - Incorrect ML Preprocessing Procedure #748

Open Jerrryyy opened 11 months ago

Jerrryyy commented 11 months ago

Hi,

For Assignment 11, the .ipynb scales the data before train-test splits (screenshot below). However, this is incorrect; scaling and centering should be done after splitting and only on the training set (scaler.fit_transform(X_train)). The parameters derived from the train set should then be applied to the test set (scaler.transform(X_test)) to prevent data leakage and biasing the model. The test set should be treated as completely new/unseen data to the model, or else it's no longer generalizable.

image

Also wanted to bring up a super minor nitpick for variable conventions. I believe ML and linear algebra typically keep X uppercase and y lowercase, since X is a matrix, while y is (often) a vector.

Thank you for the fun semester so far, Jerry

RayNele commented 11 months ago

yeah that's correct. For anyone else interested, you can read about this data leakage phenomenon from skl documentation itself.

I think we kept it simple for the sake of the "intro" aspect of the assignment. Learning what ML does, and the concept of training and testing sets is complicated enough for students who touched python for the first time 8 weeks ago.