HealthCatalyst / healthcareai-r

R tools for healthcare machine learning
https://docs.healthcare.ai
Other
245 stars 106 forks source link

Rex 694 locfimputation #1272

Closed glenrs closed 6 years ago

glenrs commented 6 years ago

@mmastand Last observation carried forward imputation is implemented!

A couple things to focus on when reviewing: 1- If the first variable value is NA, I am using the first value. I have tested this to make sure that it isn't doing anything weird. 2- For some reason recipes converts all factor variables in newdata in bake.step to character type. I have tried to test as many things as I can. I do not think this is an issue. I think that everything still works that we want. 3- Documentation. I reread the documentation several times. Please let me know if something is not clear. Thank you!

Below I have provided three examples (hopefully these help when reviewing): 1 - pima diabetes imputation with recipes 2 - pima diabetes imputation with prep_data 3 - nycflights13::flights imputation with prep_data

It appears that all missingness is removed.

I have tested other functions to see if they work. I haven't been able to crash it. Everything appears to be functioning properly

library(healthcareai)
#> healthcareai version 2.2.0
#> Please visit https://docs.healthcare.ai for full documentation and vignettes. Join the community at https://healthcare-ai.slack.com
library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: broom
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

# pima_diabetes before imputation
missingness(pima_diabetes)
#> # A tibble: 10 x 2
#>    variable       percent_missing
#>  * <chr>                    <dbl>
#>  1 patient_id               0    
#>  2 pregnancies              0    
#>  3 pedigree                 0    
#>  4 age                      0    
#>  5 diabetes                 0    
#>  6 plasma_glucose           0.651
#>  7 weight_class             1.43 
#>  8 diastolic_bp             4.56 
#>  9 skinfold                29.6  
#> 10 insulin                 48.7

# imputing with step with recipes
prepped_d <-
  recipe(formula = "~.", pima_diabetes) %>%
  step_locfimpute(all_predictors()) %>%
  prep() %>%
  bake(newdata = pima_diabetes)

missingness(prepped_d)
#> # A tibble: 10 x 2
#>    variable       percent_missing
#>  * <chr>                    <dbl>
#>  1 patient_id                   0
#>  2 pregnancies                  0
#>  3 plasma_glucose               0
#>  4 diastolic_bp                 0
#>  5 skinfold                     0
#>  6 insulin                      0
#>  7 weight_class                 0
#>  8 pedigree                     0
#>  9 age                          0
#> 10 diabetes                     0

# imputing with prep_data
prepped_d <- prep_data(pima_diabetes, outcome = diabetes, 
                       impute = list(numeric_method = "locfimpute", 
                                     nominal_method = "locfimpute"), 
                       make_dummies = FALSE)
#> Training new data prep recipe...
missingness(prepped_d)
#> # A tibble: 10 x 2
#>    variable       percent_missing
#>  * <chr>                    <dbl>
#>  1 patient_id                   0
#>  2 pregnancies                  0
#>  3 plasma_glucose               0
#>  4 diastolic_bp                 0
#>  5 skinfold                     0
#>  6 insulin                      0
#>  7 weight_class                 0
#>  8 pedigree                     0
#>  9 age                          0
#> 10 diabetes                     0

# nycflights before imputation
missingness(nycflights13::flights)
#> # A tibble: 19 x 2
#>    variable       percent_missing
#>  * <chr>                    <dbl>
#>  1 year                     0    
#>  2 month                    0    
#>  3 day                      0    
#>  4 sched_dep_time           0    
#>  5 sched_arr_time           0    
#>  6 carrier                  0    
#>  7 flight                   0    
#>  8 origin                   0    
#>  9 dest                     0    
#> 10 distance                 0    
#> 11 hour                     0    
#> 12 minute                   0    
#> 13 time_hour                0    
#> 14 tailnum                  0.746
#> 15 dep_time                 2.45 
#> 16 dep_delay                2.45 
#> 17 arr_time                 2.59 
#> 18 arr_delay                2.80 
#> 19 air_time                 2.80

# imputing nycflights with prep_data
prepped_d <- prep_data(nycflights13::flights, outcome = distance, 
                       impute = list(numeric_method = "locfimpute", 
                                     nominal_method = "locfimpute"), 
                       make_dummies = FALSE, remove_near_zero_variance = FALSE)
#> Training new data prep recipe...
missingness(prepped_d)
#> # A tibble: 25 x 2
#>    variable            percent_missing
#>  * <chr>                         <dbl>
#>  1 year                              0
#>  2 month                             0
#>  3 day                               0
#>  4 dep_time                          0
#>  5 sched_dep_time                    0
#>  6 dep_delay                         0
#>  7 arr_time                          0
#>  8 sched_arr_time                    0
#>  9 arr_delay                         0
#> 10 carrier                           0
#> 11 flight                            0
#> 12 tailnum                           0
#> 13 origin                            0
#> 14 dest                              0
#> 15 air_time                          0
#> 16 distance                          0
#> 17 hour                              0
#> 18 minute                            0
#> 19 time_hour_dow_sin                 0
#> 20 time_hour_dow_cos                 0
#> 21 time_hour_month_sin               0
#> 22 time_hour_month_cos               0
#> 23 time_hour_year                    0
#> 24 time_hour_hour_sin                0
#> 25 time_hour_hour_cos                0

Created on 2018-10-09 by the reprex package (v0.2.0).

codecov[bot] commented 6 years ago

Codecov Report

Merging #1272 into master will increase coverage by <.1%. The diff coverage is 100%.

@@           Coverage Diff            @@
##           master   #1272     +/-   ##
========================================
+ Coverage    95.3%   95.3%   +<.1%     
========================================
  Files          40      41      +1     
  Lines        3183    3224     +41     
========================================
+ Hits         3034    3075     +41     
  Misses        149     149