cytomining / pycytominer

Python package for processing image-based profiling data
https://pycytominer.readthedocs.io
BSD 3-Clause "New" or "Revised" License
78 stars 35 forks source link

FeatureRequest: Machine Learning Data Split Functionality #427

Open MikeLippincott opened 2 months ago

MikeLippincott commented 2 months ago

Feature type

General description of the proposed functionality

This functionality is multi-part: Currently, machine learning data splits are performed after normalization and feature selection. This poses a potential for data leakage into the models. The proposed fix is the implement functions that perform data splits prior to normalization. To implement this the normalization and feature selection would need to be applied to the training split and then propagated to the validation, testing, [*holdout] splits. I envision most of this functionality needing to be carried out by the user. To do so, I suggest updating the noramlize and feature selection function and implement two more functions:

  1. Normalize function: Save the transformation performed on the training data
  2. Implement a normalize transform function to propagate the saved transformation to testing data splits
  3. Feature select function: Save the feature selected columns list
  4. Implement a feature selection column propagation to each test data split.

Feature example

change function: def noramlize(*args, **kwargs, blah, blah, blah, save_xform:bool = False): normalize_function_magic_that_is_happinging

if not save_xform:
    continue
else:
    save_the_transform (probably as a numpy array), depending on method it could be parameters too (mean, std)

new function: def apply_xform(*args, **kwargs): apply the saved xform the test data splits here

change function: def feature_select(*args, **kwargs, save_selected_feature_list: bool = False): feature selection magic

if not save_selected_feature_list:
    continue
else:
    save the list of features to apply to a dataset

new function: def apply_selected_features(*args, **kwargs): apply the selected features from train split to test split.

Alternative Solutions

No response

Additional information

No response

MikeLippincott commented 2 months ago

This is something that I can implement. Do I need approval prior to opening a PR?

d33bs commented 2 months ago

Thanks for adding this issue @MikeLippincott ! No special approval required before opening a PR, please feel free to propose changes and open when ready.

On the enhancement development: based on how you outlined the issue would it be possible to add a test which checks for leakage you mentioned? I imagine this would help prove the new capabilities in addition to making sure future changes don't reintroduce leakage. Totally open to your thoughts here (this isn't a requirement).

gwaybio commented 2 months ago

related to #154