CBICA / CaPTk

Cancer Imaging Phenomics Toolkit (CaPTk) is a software platform to perform image analysis and predictive modeling tasks. Documentation: https://cbica.github.io/CaPTk
https://www.cbica.upenn.edu/captk
Other
175 stars 63 forks source link

[TrainingModule] Investigate approaches to feature Z-scoring for cross-validation #1377

Open AlexanderGetka-cbica opened 3 years ago

AlexanderGetka-cbica commented 3 years ago

Is your feature request related to a problem? Please describe.

Currently, the z-scoring (or, as it's referred to in the code, feature scaling) of features is performed as soon as features are loaded from disk. That is, the user specified feature file is loaded into memory, and then immediately transformed: each feature is scaled according to its mean and standard deviation as calculated from all samples of that feature.

But when we are performing cross-validation, this raises a question: isn't doing this across the whole set introducing data leakage? Since z-scores are calculated from the whole set of samples, this means that the scaled training set includes information about the holdout samples (especially so if the holdout has significantly different values).

Describe the solution you'd like What I think makes the most sense: When it comes to splitting the dataset for cross validation, get the mean/std-dev information from only the training set, then scale the holdout/testing set according to the training set stats before continuing with training and prediction. I think this is more consistent with how independent training and testing would work, much like testing your trained model on completely unknown data.

Describe alternatives you've considered Keep it as it is now. OR Scale the training set and testing set wholly independently... but then their scaled values do not actually represent the same data, as in this simple example:

After z-scoring, the raw value of 60 for X in the holdout set would become the z-score "1"... but a raw value of 6 for X in the training set would also become z-score "1". I am not certain on this, but this seems like it'd generate a meaningless result.

Additional context It was noted during discussion that at some level there does need to be normalization of unknown testing data to the training data when it comes to e.g. image intensities, to avoid bias from institutional or scanner-specific effects. But this is/should be handled by preprocessing tools (Z-scoring normalization, n3/n4 bias correction), that should(?) be performed before extraction of features.

We need to do some experimentation to figure out what works best.

AlexanderGetka-cbica commented 3 years ago

Pinging @sbakas @sarthakpati for any further thoughts on this based on today's discussion.

For now, I want to leave this as is to avoid any complications on the algorithmic side, but longer term this should be addressed.