HealthCatalyst / healthcareai-py

Python tools for healthcare machine learning
http://healthcare.ai
MIT License
316 stars 188 forks source link

Imputation of missing values using ML models. #477

Open vijayphugat opened 6 years ago

vijayphugat commented 6 years ago

Current package impute missing values using mean and median.

Now I have identified an approach to apply Machine Learning models for imputing the missing values:

Existing approach: Impute missing values using Mean/Median Drawback:

  1. It reduces the variability in the data
  2. It does not preserve relationships between variables such as correlations.

New Approach: Impute missing values as per below:

Advantages:

  1. It will preserve the original relationships between variables
  2. It will mainatain the original variability in data

So for the datasets having large number of missing values, this approach can improve the overall quality of data to be feeded to ML algorithms. Thus perfomance of existing model can be improved using this imputation stratgey.

SameerMahajan-GSLab commented 6 years ago

@levithatcher @taylorlarsen and @mmastand do you have any input on this? Otherwise we will submit a PR for its fix soon.

mmastand commented 6 years ago

We have had good luck using the following in our other work:

What do you think about these methods? We'd be grateful if you wanted to do a PR!

vijayphugat commented 6 years ago

I want to do PR on this finding and currently I am using below techniques:

Both of these methods work good on linear as well as non-linear type of data.