HealthCatalyst / healthcareai-py

Python tools for healthcare machine learning
http://healthcare.ai
MIT License
313 stars 186 forks source link

Imputation of missing values using ML models. (Enhancement and Bug fix opened in #477) #478

Closed vijayphugat closed 6 years ago

vijayphugat commented 6 years ago

I have implemented the functionality for imputation of missing values using ML models. Now below additional options are provided in healthcareai.common.transformers.DataFrameImputer class:

imputeStrategy : string, default='MeanMode' It decides the technique will be used for imputation of missing values.

tunedRandomForest : boolean, default=False If set to True, RandomForestClassifier/RandomForestRegressor to be used for imputation of missing values are tuned using grid search and K-fold cross validation.

**** Bug Fix *** In existing code there was no provision to handle the columns which are of type int/float but by nature they are categorical. < ex JobCode ( Levels : 1,2,3,4,5,6) >. Therefor this type of column were imputed using Mean value ( ex. 2.8, 3.6 etc) which can be very hazardous.

I handled this problem as well for both imputation strategy i.e 'MeanMedian' and 'RandomForest'. Now user can use below parameter to explicitly mention such type of columns.

numeric_columns_as_categorical : List of type String, default=None List of column names which are numeric(int/float) in dataframe, but by nature they are to be considered as categorical.

For example:
There is a column JobCode( Levels : 1,2,3,4,5,6)
If there are missing values in JobCode column, panadas will by default convert this column into type float.

If numeric_columns_as_categorical=None
    Missing values of this column will be imputed by Mean value of JobCode column.
    type of 'JobCode' column will remain float. 
If numeric_columns_as_categorical=['JobCode']
    Missing values of this column will be imputed by mode value of JobCode column.
    Also final type of 'JobCode' column will be numpy.object 

The existing approach of missing value imputation( using Mean/Mode) is preserved with one fix.

vijayphugat commented 6 years ago

Raised the same pull request from different account. Therefore closing this pull request