HealthCatalyst / healthcareai-r

R tools for healthcare machine learning
https://docs.healthcare.ai
Other
245 stars 106 forks source link

Ensemble Function added here #1205

Closed AviralVijay-GSLab closed 6 years ago

AviralVijay-GSLab commented 6 years ago

I have found the 5% increases in accuracy using Ensemble for binary classes in the outcome variable(Specific for binary classes of outcome only). @michaellevy , @mmastand

michaellevy commented 6 years ago

Hi @AviralVijay-GSLab! Welcome to healthcare.ai -- it's great to see your enthusiasm for contributing to the package. We have a pretty well developed pipeline for model training, interpretation, and prediction, so we'd want model ensembles to fit into that pipeline, i.e. be trained via the tune_models and flash_models functions.

Can you clarify for me how you are thinking about model ensembles here? As you may have noticed, the above-mentioned model-training functions train three algorithms by default: regularized regression, random forest, and xgboost. My first thought would be to leverage the strengths of ensembling by averaging predictions over those algorithms, perhaps in proportion to their performance.

Also be sure to check out the contributing guidelines. It's fine with me if you want to work on the core functionality first and get approval for that before working on things like documentation and tests.

AviralVijay-GSLab commented 6 years ago

Hi @michaellevy , Thanks for inspiring words. Sure, I can clarify the Design thinking approach here for model ensemble. It bases on the complexity of healthcare domain, i.e there are plenty of information which may be seems like not useful but usually consist the causes for health diseases like the last visited place, as a technology folks one may think this is not relevant information but it may have significant impact on domain perspective as the disease spreads more the particular Geo-graphical region. machine_learn function does the model training using tune_models or flash_models function but it does not includes the ignored variable in the model training, there is a need of hybrid ensemble framework which can help here. I have raised two difference issue related to this #1206 & #1207 . Please find the attached pdf file for more information.

Feature_Request.pdf

@SameerMahajan-GSLab can you also provide more insights if needed?

michaellevy commented 6 years ago

machine_learn function does the model training using tune_models or flash_models function but it does not includes the ignored variable in the model training, there is a need of hybrid ensemble framework which can help here.

@AviralVijay-GSLab I think there are two separate issues here: 1) ignored variables, and 2) ensemble models.

  1. prep_data and machine_learn provide the ... argument for ignored variables so that the user can pass in row-identifying columns that should not be used in model training, but should be retained in prediction for joining to other tables or writing back to a database. These columns would typically have unique values on each row (as in pima_diabetes$patient_id), so they could not provide useful information for model training.

  2. Currently predictions are generated using the model that performed best in cross-validation during model training. I do think it would be valuable to provide an ensemble option to predict.model_list that makes predictions using all of the trained models and then combines them, averaging in proportion to their performance. Is that something you'd be interested in working on?

AviralVijay-GSLab commented 6 years ago

@michaellevy I can do that in that way also but the issue is, all the models trained using the same prepared data having the same features excluding the ignored variables(expected). I want to provide a framework to the user where one can choose to include or exclude the knowledge of ignored variable in the trained models i.e different models having strong or weak predictors only, therefore, in that case, we need to feed different prepared data into two best performing models:- first one will have all strong predictors and second one have weak predictors, I have drafted such framework in ensemble.R file in the same pull request where user can decide to give ensemble true(include weak predictors also with strong) or false(model will be trained by only strong predictors). As you suggested, I can use the pipeline functions as tune_models and flash_models in place of neural net if required but prepared data have the ignored columns also which is used in model training, this also needs to handle in that case, as i have shown in feature_request.pdf file at above comments. When it comes to the write back to the database after prediction in that case, prediction data can be different than training data. Please suggest.

michaellevy commented 6 years ago

@AviralVijay-GSLab In your example in the pdf, you pass pedigree to ... to be ignored. That's not what the ignored variables are for. Why would you choose to ignore a potentially informative feature? All the algorithms we train have regularization of some sort built in, so if a feature is uninformative, it won't be used. The purpose of the ... is to hold identifying columns that should have zero predictive value; no other columns should be passed to that argument.

That said, I still think there is a place for ensembles in the package, but it has to be done in a way that fits the package's use design. If you haven't already, you might consider working through the Getting Started vignette to get a better sense of how the package is intended to be used.

AviralVijay-GSLab commented 6 years ago

@michaellevy I have gone through the Getting Started vignette , I agree to develop the ensemble functionality in a way that fits the package's use design. I can do this as you suggested by providing one more parameter(ensemble) in predict_model.list function which can be set TRUE or FALSE; Default(TRUE), so that all the trained models will be used for final prediction via average out or Majority voting ensemble technique, Please suggest if you are good with this so that i can start working on it & will deliver in a few days.

michaellevy commented 6 years ago

@AviralVijay-GSLab That sounds great. I appreciate the discussion and your willingness to develop in a way that fits with our established use design and style. You'll see that predict.model_list only returns probabilities for classification problems, so I don't think majority voting is an option. One thing that I think will be important to consider is how to weight the models' predictions. I've only ever built ensembles of statistical (as opposed to machine learning) models, and in that context I've weighted by WAIC. I suppose we could accomplish something similar by weight in proportion to the users' selected performance metric, but I imagine this is something smart people have thought hard about, and if you could figure out what the established best practices are and use those, that would be great.

codecov[bot] commented 6 years ago

Codecov Report

Merging #1205 into master will decrease coverage by 5.9%. The diff coverage is 0%.

@@           Coverage Diff           @@
##           master   #1205    +/-   ##
=======================================
- Coverage    94.2%   88.3%    -6%     
=======================================
  Files          37      37            
  Lines        2408    2363    -45     
=======================================
- Hits         2270    2087   -183     
- Misses        138     276   +138
michaellevy commented 6 years ago

Closing this as I believe it has been replaced by #1212