[TrainingModule] Investigate approaches to feature Z-scoring for cross-validation

Is your feature request related to a problem? Please describe.

Currently, the z-scoring (or, as it's referred to in the code, feature scaling) of features is performed as soon as features are loaded from disk. That is, the user specified feature file is loaded into memory, and then immediately transformed: each feature is scaled according to its mean and standard deviation as calculated from all samples of that feature.

But when we are performing cross-validation, this raises a question: isn't doing this across the whole set introducing data leakage? Since z-scores are calculated from the whole set of samples, this means that the scaled training set includes information about the holdout samples (especially so if the holdout has significantly different values).

Describe the solution you'd like What I think makes the most sense: When it comes to splitting the dataset for cross validation, get the mean/std-dev information from only the training set, then scale the holdout/testing set according to the training set stats before continuing with training and prediction. I think this is more consistent with how independent training and testing would work, much like testing your trained model on completely unknown data.

Describe alternatives you've considered Keep it as it is now. OR Scale the training set and testing set wholly independently... but then their scaled values do not actually represent the same data, as in this simple example:

For some feature X:
- Training set mean of X is 5 with a standard deviation of 1
- Holdout set (for whatever reason) mean of X is 50, standard deviation of 10.

After z-scoring, the raw value of 60 for X in the holdout set would become the z-score "1"... but a raw value of 6 for X in the training set would also become z-score "1". I am not certain on this, but this seems like it'd generate a meaningless result.

Additional context It was noted during discussion that at some level there does need to be normalization of unknown testing data to the training data when it comes to e.g. image intensities, to avoid bias from institutional or scanner-specific effects. But this is/should be handled by preprocessing tools (Z-scoring normalization, n3/n4 bias correction), that should(?) be performed before extraction of features.

We need to do some experimentation to figure out what works best.

CBICA / CaPTk

[TrainingModule] Investigate approaches to feature Z-scoring for cross-validation #1377