Open MikeLippincott opened 2 months ago
This is something that I can implement. Do I need approval prior to opening a PR?
Thanks for adding this issue @MikeLippincott ! No special approval required before opening a PR, please feel free to propose changes and open when ready.
On the enhancement development: based on how you outlined the issue would it be possible to add a test which checks for leakage you mentioned? I imagine this would help prove the new capabilities in addition to making sure future changes don't reintroduce leakage. Totally open to your thoughts here (this isn't a requirement).
related to #154
Feature type
[X] Add new functionality
[X] Change existing functionality
General description of the proposed functionality
This functionality is multi-part: Currently, machine learning data splits are performed after normalization and feature selection. This poses a potential for data leakage into the models. The proposed fix is the implement functions that perform data splits prior to normalization. To implement this the normalization and feature selection would need to be applied to the training split and then propagated to the validation, testing, [*holdout] splits. I envision most of this functionality needing to be carried out by the user. To do so, I suggest updating the noramlize and feature selection function and implement two more functions:
Feature example
change function: def noramlize(*args, **kwargs, blah, blah, blah, save_xform:bool = False): normalize_function_magic_that_is_happinging
new function: def apply_xform(*args, **kwargs): apply the saved xform the test data splits here
change function: def feature_select(*args, **kwargs, save_selected_feature_list: bool = False): feature selection magic
new function: def apply_selected_features(*args, **kwargs): apply the selected features from train split to test split.
Alternative Solutions
No response
Additional information
No response