HealthCatalyst / healthcareai-r

R tools for healthcare machine learning
https://docs.healthcare.ai
Other
245 stars 106 forks source link

Throw an error if machine_learn is given prepped data #1255

Closed glenrs closed 6 years ago

glenrs commented 6 years ago

Currently if data is already prepped, machine_learn will reprep already prepped data and replace the original recipe that is in the prepped object. This is confusing because the recipe object listed as an attribute of the returned model object will be a new recipe object. This error message should notify the user that they should not use machine_learn for objects that are already prepped.

library(healthcareai)
#> healthcareai version 2.2.0
#> Please visit https://docs.healthcare.ai for full documentation and vignettes. Join the community at https://healthcare-ai.slack.com
library(tidyverse)

prepped_d <- prep_data(pima_diabetes, patient_id, outcome = diabetes)
#> Training new data prep recipe...
m <- machine_learn(prepped_d, patient_id, outcome = diabetes, models = "rf", 
                   tune = FALSE)
#> Training new data prep recipe...
#> Removing the following 2 near-zero variance column(s). If you don't want to remove them, call prep_data with remove_near_zero_variance as a smaller numeric or FALSE.
#>   weight_class_other and weight_class_missing
#> Variable(s) ignored in prep_data won't be used to tune models: patient_id
#> 
#> diabetes looks categorical, so training classification algorithms.
#> 
#> After data processing, models are being trained on 10 features with 768 observations.
#> Based on n_folds = 5 and hyperparameter settings, the following number of models will be trained: 5 rf's
#> Training at fixed values: Random Forest
#> 
#> *** Models successfully trained. The model object contains the training data minus ignored ID columns. ***
#> *** If there was PHI in training data, normal PHI protocols apply to the model object. ***

testthat::expect_equal(attr(m, "recipe"), attr(prepped_d, "recipe"))
#> Error: attr(m, "recipe") not equal to attr(prepped_d, "recipe").
#> Attributes: < Component "factor_levels": Names: 1 string mismatch >
#> Attributes: < Component "factor_levels": Length mismatch: comparison on first 1 components >
#> Attributes: < Component "factor_levels": Component 1: Names: 2 string mismatches >
#> Attributes: < Component "factor_levels": Component 1: Attributes: < Component "dim": Mean relative difference: 2 > >
#> Attributes: < Component "factor_levels": Component 1: Attributes: < Component "dimnames": Component "": 2 string mismatches > >
#> Attributes: < Component "factor_levels": Component 1: Numeric: lengths (2, 6) differ >
#> Attributes: < Component "missingness": Names: 8 string mismatches >
#> Attributes: < Component "missingness": Numeric: lengths (14, 10) differ >
#> Component "var_info": Different number of rows
#> ...

Created on 2018-09-06 by the reprex package (v0.2.0).

glenrs commented 6 years ago

@mmastand we ran into this issue last week at HAS. It should be a simple one. Can I fix this real quick?

mmastand commented 6 years ago

Go for it!