I have implemented the functionality for imputation of missing values using ML models.
Now below additional options are provided in healthcareai.common.transformers.DataFrameImputer class:
imputeStrategy : string, default='MeanMode'
It decides the technique will be used for imputation of missing values.
If imputeStrategy = 'MeanMode'
Imputation is done using Mean/Mode
If imputeStrategy = 'RandomForest'
Imputation is done using ML models.
tunedRandomForest : boolean, default=False
If set to True, RandomForestClassifier/RandomForestRegressor to be used for imputation of missing values are tuned using grid search and K-fold cross validation.
**** Bug Fix ***
In existing code there was no provision to handle the columns which are of type int/float but by nature they are categorical. < ex JobCode ( Levels : 1,2,3,4,5,6) >. Therefor this type of column were imputed using Mean value ( ex. 2.8, 3.6 etc) which can be very hazardous.
I handled this problem as well for both imputation strategy i.e 'MeanMedian' and 'RandomForest'.
Now user can use below parameter to explicitly mention such type of columns.
numeric_columns_as_categorical : List of type String, default=None
List of column names which are numeric(int/float) in dataframe, but by nature they are to be considered as categorical.
For example:
There is a column JobCode( Levels : 1,2,3,4,5,6)
If there are missing values in JobCode column, panadas will by default convert this column into type float.
If numeric_columns_as_categorical=None
Missing values of this column will be imputed by Mean value of JobCode column.
type of 'JobCode' column will remain float.
If numeric_columns_as_categorical=['JobCode']
Missing values of this column will be imputed by mode value of JobCode column.
Also final type of 'JobCode' column will be numpy.object
The existing approach of missing value imputation( using Mean/Mode) is preserved with one fix.
I have implemented the functionality for imputation of missing values using ML models. Now below additional options are provided in healthcareai.common.transformers.DataFrameImputer class:
imputeStrategy : string, default='MeanMode' It decides the technique will be used for imputation of missing values.
tunedRandomForest : boolean, default=False If set to True, RandomForestClassifier/RandomForestRegressor to be used for imputation of missing values are tuned using grid search and K-fold cross validation.
**** Bug Fix *** In existing code there was no provision to handle the columns which are of type int/float but by nature they are categorical. < ex JobCode ( Levels : 1,2,3,4,5,6) >. Therefor this type of column were imputed using Mean value ( ex. 2.8, 3.6 etc) which can be very hazardous.
I handled this problem as well for both imputation strategy i.e 'MeanMedian' and 'RandomForest'. Now user can use below parameter to explicitly mention such type of columns.
numeric_columns_as_categorical : List of type String, default=None List of column names which are numeric(int/float) in dataframe, but by nature they are to be considered as categorical.
The existing approach of missing value imputation( using Mean/Mode) is preserved with one fix.