brian-j-smith / MachineShop

MachineShop: R package of models and tools for machine learning
https://brian-j-smith.github.io/MachineShop/
62 stars 10 forks source link

prep.ModelRecipe() skips prep.recipe() #4

Closed DavisVaughan closed 2 years ago

DavisVaughan commented 3 years ago

Hi there! We've made a few changes in recipes to fully support modern tidyselect. You can see that PR here https://github.com/tidymodels/recipes/pull/595

In the process, we slightly tweaked bake() to more directly use an element of a prepped recipe called $last_term_info. This element is only added to the recipe at prep() time, and is something we now left_join() against in bake().

It seems that prep.ModelRecipe() can skip the call to prep() if the recipe looks fully trained. https://github.com/brian-j-smith/MachineShop/blob/8777d6b19d453d1d0d9e96d76ed72319ae49cec1/R/ModelRecipe.R#L76-L85

For better or worse, a recipe looks fully trained if there are 0 steps, which means that prep.recipe() won't be called in those cases - meaning that $last_term_info won't get added onto the recipe. This causes this failure in our revdeps https://github.com/tidymodels/recipes/pull/595#issuecomment-722587579

Is there any way you can just let prep.recipe() get called unconditionally in prep.ModelRecipe()?

As an FYI, we plan to submit recipes to CRAN fairly soon.

brian-j-smith commented 3 years ago

Hi @DavisVaughan,

Thanks for the heads up on the recipe changes, for explaining the change, and for identifying the affected code in MachineShop. I really appreciate you taking the time to do that and should have some time starting this Wednesday to look into the issue.

I was just about to release a new version of the package, so this is good timing, and will aim to get it working with the upcoming recipes release soon.

Regards.

brian-j-smith commented 3 years ago

I don't seem to be getting an error when using MachineShop with the latest development version (0.1.15.9000) of recipes and my R installation. Do you see any of my package versions below that are different from yours?

> library(MachineShop)
> example(ModeledInput)

MdldIn> ## Modeled model frame
MdldIn> mod_mf <- ModeledInput(sale_amount ~ ., data = ICHomes, model = GLMModel)

MdldIn> fit(mod_mf)

Call:  stats::glm(formula = formula, family = family, data = data, weights = weights, 
    control = control)

Coefficients:
                  (Intercept)                      sale_year                     sale_month  
                   -1.742e+07                      2.513e+03                     -7.918e+02  
                        built                     styleCondo      construction1 Story Brick  
                    3.029e+02                     -1.774e+04                     -1.176e+04  
    construction1 Story Condo      construction1 Story Frame      construction2 Story Brick  
                   -6.658e+04                     -6.416e+04                      1.131e+05  
    construction2 Story Condo      construction2 Story Frame  constructionSplit Foyer Frame  
                    9.514e+03                      2.323e+04                     -5.714e+04  
constructionSplit Level Frame                      base_size                       add_size  
                   -3.302e+04                      1.635e+02                      1.433e+02  
                 garage1_size                   garage2_size                       lot_size  
                    5.151e+01                      6.838e+01                      1.232e+00  
                     bedrooms                    basementYes                          acYes  
                   -1.557e+03                      2.387e+04                     -2.600e+03  
                     atticYes                            lon                            lat  
                    4.763e+04                      5.430e+03                      2.949e+05  

Degrees of Freedom: 752 Total (i.e. Null);  729 Residual
Null Deviance:      6.323e+12 
Residual Deviance: 1.76e+12     AIC: 18430

MdldIn> ## Modeled recipe
MdldIn> library(recipes)

MdldIn> rec <- recipe(sale_amount ~ ., data = ICHomes)

MdldIn> mod_rec <- ModeledInput(rec, model = GLMModel)

MdldIn> fit(mod_rec)

Call:  stats::glm(formula = formula, family = family, data = data, weights = weights, 
    control = control)

Coefficients:
                  (Intercept)                      sale_year                     sale_month  
                   -1.742e+07                      2.513e+03                     -7.918e+02  
                        built                     styleCondo      construction1 Story Brick  
                    3.029e+02                     -1.774e+04                     -1.176e+04  
    construction1 Story Condo      construction1 Story Frame      construction2 Story Brick  
                   -6.658e+04                     -6.416e+04                      1.131e+05  
    construction2 Story Condo      construction2 Story Frame  constructionSplit Foyer Frame  
                    9.514e+03                      2.323e+04                     -5.714e+04  
constructionSplit Level Frame                      base_size                       add_size  
                   -3.302e+04                      1.635e+02                      1.433e+02  
                 garage1_size                   garage2_size                       lot_size  
                    5.151e+01                      6.838e+01                      1.232e+00  
                     bedrooms                    basementYes                          acYes  
                   -1.557e+03                      2.387e+04                     -2.600e+03  
                     atticYes                            lon                            lat  
                    4.763e+04                      5.430e+03                      2.949e+05  

Degrees of Freedom: 752 Total (i.e. Null);  729 Residual
Null Deviance:      6.323e+12 
Residual Deviance: 1.76e+12     AIC: 18430
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] recipes_0.1.15.9000 dplyr_1.0.2         MachineShop_2.5.0  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5         lubridate_1.7.9    mvtnorm_1.1-1      lattice_0.20-41    tidyr_1.1.2       
 [6] listenv_0.8.0      prettyunits_1.1.1  class_7.3-17       zoo_1.8-8          assertthat_0.2.1  
[11] digest_0.6.27      ipred_0.9-9        foreach_1.5.1      parallelly_1.21.0  R6_2.5.0          
[16] stats4_4.0.3       ggplot2_3.3.2      pillar_1.4.6       rlang_0.4.8        progress_1.2.2    
[21] multcomp_1.4-14    rstudioapi_0.12    DiceDesign_1.8-1   furrr_0.2.1        kernlab_0.9-29    
[26] rpart_4.1-15       Matrix_1.2-18      splines_4.0.3      gower_0.2.2        munsell_0.5.0     
[31] compiler_4.0.3     pkgconfig_2.0.3    libcoin_1.0-6      globals_0.13.1     dials_0.0.9       
[36] nnet_7.3-14        tidyselect_1.1.0   tibble_3.0.4       prodlim_2019.11.13 coin_1.3-1        
[41] codetools_0.2-16   matrixStats_0.57.0 fansi_0.4.1        future_1.20.1      crayon_1.3.4      
[46] withr_2.3.0        MASS_7.3-53        grid_4.0.3         polspline_1.1.19   gtable_0.3.0      
[51] lifecycle_0.2.0    magrittr_1.5       scales_1.1.1       cli_2.1.0          rsample_0.0.8     
[56] party_1.3-5        strucchange_1.5-2  timeDate_3043.102  ellipsis_0.3.1     generics_0.1.0    
[61] vctrs_0.3.4        sandwich_3.0-0     TH.data_1.0-10     lava_1.6.8.1       iterators_1.0.13  
[66] tools_4.0.3        glue_1.4.2         purrr_0.3.4        hms_0.5.3          abind_1.4-5       
[71] parallel_4.0.3     survival_3.2-7     colorspace_1.4-1   modeltools_0.2-23 
DavisVaughan commented 3 years ago

I think this PR may have actually fixed the issue, since now fully_trained() will return FALSE when there are zero steps and prep() hasn't been run https://github.com/tidymodels/recipes/pull/600

Nevertheless, I think I would still encourage always running through prep() unconditionally

brian-j-smith commented 3 years ago

Thanks. I'll keep your recommendation about prep() in mind as I think about the code.

topepo commented 3 years ago

I just ran reverse dependencies for recipes (which might go to CRAN next week) and had an error with the help example in ?ModeledInput.

The error seems to occur in .fit.MLModel; when juice() is run on the data, there are no data points available since prep.ModelRecipe() calls recipes::prep(retain = FALSE).

We fixed a bug (tidymodels/recipes#652) where the training data were kept (to some extent) when retain = FALSE. I think that this is the change. I would suggest using retain = TRUE so that you can use juice() without an error.

brian-j-smith commented 3 years ago

Thanks for the heads-up on the retain change in prep(). I've uploaded a fix to the master branch of MachineShop for its affected function and will include it in the next CRAN release in sync with recipes.