Add errors or functionality to deal with missing outcomes and features

SchlossLab / mikropml

User-Friendly R Package for Supervised Machine Learning Pipelines

http://www.schlosslab.org/mikropml

Other

54 stars 17 forks source link

Add errors or functionality to deal with missing outcomes and features #150

Closed zenalapp closed 4 years ago

zenalapp commented 4 years ago

[x] For missing outcomes, remove those samples.
- [x] Message to user in preprocessing function if removing samples with missing outcome.
[x] For missing features:
- [x] Continuous - take median.
- [x] Binary/categorical - make 0 (e.g. c('yes','no',NA) becomes two columns: c(1,0,0) and c(0,1,0).
- [x] No variation - keep no variation by making the NA values whatever the value is of no variation (maybe not the best way to do it? But usually you remove those ones anyways. I think the user can change those before using the function if they want.)
- [x] Message to user in preprocessing function if imputing NA values.
[x] Checks at beginning of run_ml for no NA values anywhere in input data frame.

BTopcuoglu commented 4 years ago

It might be a good idea to remove the sample if that sample has a missing feature..I know it reduces the power but it is probably better than losing a feature. What do you think @zenalapp?

zenalapp commented 4 years ago

But what if one feature is missing in, say, 80% of the samples? Let's check to see how caret deals with missing features and then we can go from there. No matter what we do, we should have a message to the user saying what was removed so they know.

BTopcuoglu commented 4 years ago

Oh right - good catch!

We can ask the user to give the package non-NA data..They can decide how they want to clean it up..But you're right, maybe the models already deal with it themselves. I remember some models knowing how to deal with NAs.

BTopcuoglu commented 4 years ago

I think we might be able to get away with the train and predict functions but hp tuning step will require us to do something with the NAs: https://stats.stackexchange.com/questions/144922/r-caret-and-nas

zenalapp commented 4 years ago

I say we just require that the user deal with NAs before using the run_ml function. In the preprocess function, maybe we can provide two options - either remove features with missing values, or remove observations with missing values.

zenalapp commented 4 years ago

@BTopcuoglu I updated the way I'm thinking of dealing with missing values - let me know if you think we should do it differently! Also, do you think it makes sense to remove features where there's no variation other than that some are NA (e.g. c('a','a',NA))?

Also, how fool-proof should I make it? Should I convert character/factor features to numeric if possible?

BTopcuoglu commented 4 years ago

@BTopcuoglu I updated the way I'm thinking of dealing with missing values - let me know if you think we should do it differently! Also, do you think it makes sense to remove features where there's no variation other than that some are NA (e.g. c('a','a',NA))?

Also, how fool-proof should I make it? Should I convert character/factor features to numeric if possible?

I think what you have in there now is very good in terms of imputations and NZV features.

I think converting to numeric is a good idea too..It might reduce frustrations for off the shelf usage.

zenalapp commented 4 years ago

Done!