What it says on the tin; right now Data Managers are not very configurable, and have a number of assumptions on what they can and cannot being applied in the StudyManager.
This greatly limits what they can do, and how easily they can be extended going forward.
Proposed Solution:
To resolve this:
The base DataManager will be re-written to only explicitly require bare-bones functionality (sampling, loading, and configuration). They will also provide two hooks by default:
Pre-split; Functions which should be fit (if applicable) and applied to the whole dataset before a train-test split occurs
Post-split; Functions which should only be fit on some training data (if applicable), but applied to both training and testing
DataManager subclasses now only need to denote how they should be configured, how samples should be accessed/loaded, and how to load the data based on the configuration.
All other functionality will be delegated to Pythonic mixins, with DataManager sub-classes using them to denote some capability beyond the bare-bones abilities detailed above.
Changes Required:
For the initial commit which will integrate these changes, only 1 subclasses and a handful of feature mixins will be implemented:
TabularDataManager: A DataManager subclass which handles tabular data, managed by Pandas on the backend
MultiFeatureMixin; Denotes that a manager handles multiple features, and can select/drop them as needed
TransformableMixin: Denotes that the data in the manager can be transformed without effecting its integrity
ImputableMixin: Denotes that a manager can have missing values imputed. Is itself a subclass of TransformableMixin
A number of data hooks will also be provided:
FeatureNullityCheck: Check and drop features whose null content is over a certain threshold.
SampleNullityCheck: Check and drop samples whose null content is over a certain threshold.
SimpleCategoricalImputation: Imputes categorical data in the dataset, by default via mode
SimpleContinuousImputation: Imputes continuous data in the dataset, by default via mean
ZNormStandardize: Standardizes a dataset so that all features have a mean of 0 and a STD of 1
Problem:
What it says on the tin; right now Data Managers are not very configurable, and have a number of assumptions on what they can and cannot being applied in the StudyManager.
This greatly limits what they can do, and how easily they can be extended going forward.
Proposed Solution:
To resolve this:
DataManager
will be re-written to only explicitly require bare-bones functionality (sampling, loading, and configuration). They will also provide two hooks by default:DataManager
subclasses now only need to denote how they should be configured, how samples should be accessed/loaded, and how to load the data based on the configuration.DataManager
sub-classes using them to denote some capability beyond the bare-bones abilities detailed above.Changes Required:
For the initial commit which will integrate these changes, only 1 subclasses and a handful of feature mixins will be implemented:
TabularDataManager
: ADataManager
subclass which handles tabular data, managed by Pandas on the backendMultiFeatureMixin
; Denotes that a manager handles multiple features, and can select/drop them as neededTransformableMixin
: Denotes that the data in the manager can be transformed without effecting its integrityImputableMixin
: Denotes that a manager can have missing values imputed. Is itself a subclass ofTransformableMixin
A number of data hooks will also be provided:
FeatureNullityCheck
: Check and drop features whose null content is over a certain threshold.SampleNullityCheck
: Check and drop samples whose null content is over a certain threshold.SimpleCategoricalImputation
: Imputes categorical data in the dataset, by default via modeSimpleContinuousImputation
: Imputes continuous data in the dataset, by default via meanZNormStandardize
: Standardizes a dataset so that all features have a mean of 0 and a STD of 1