BlasBenito / spatialRF

R package to fit spatial models with Random Forest
https://blasbenito.github.io/spatialRF/
109 stars 16 forks source link

the_feature_engineer() returns different data types depending on whether promising interactions are found #9

Closed mikoontz closed 2 years ago

mikoontz commented 2 years ago

I'm really enjoying using this package! Thank you so much for writing it. I hope it's okay to chime in about a few specific details/feature requests that other users might also find useful as I'm learning to use it.

One thing I've come across as I create a workflow is that the the_feature_engineer() function appears to return different data types depending on whether promising interactions are found. If promising interactions are found, a list is returned. If no promising interactions are found, NA is returned.

For my use case, anyway, it would smooth out the workflow if the returned data type were always a list with some of the named list elements being NULL if they are not applicable, but others getting filled in if possible. Particularly the $data and $predictor.variable.names list elements.

The tutorial (which is great) currently uses the following code block to "update" the data and predictor variable names which will be passed to the actual call to build the random forest model:

#adding interaction column to the training data
plant_richness_df <- interactions$data

#adding interaction name to predictor.variable.names
predictor.variable.names <- interactions$predictor.variable.names

But these lines won't work if the_feature_engineer() has returned NA, implying no promising interactions.

I suppose a user could always use an if(is.na(feature_engineer_returned_results)) to either update the data/predictor variable names or not. Maybe that's better and/or your intention? What do you think?

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] vegan_2.5-7       lattice_0.20-45   permute_0.9-7     spatialRF_1.1.3   terra_1.4-20      tidyr_1.1.4       data.table_1.14.2 sf_1.0-4          ggplot2_3.3.5    
[10] dplyr_1.0.7      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7          lubridate_1.8.0     class_7.3-19        assertthat_0.2.1    digest_0.6.28       foreach_1.5.1       utf8_1.2.2          ranger_0.13.1      
 [9] R6_2.5.1            backports_1.3.0     USAboundaries_0.4.0 evaluate_0.14       e1071_1.7-9         pillar_1.6.4        rlang_0.4.12        rstudioapi_0.13    
[17] Matrix_1.3-4        rmarkdown_2.11      labeling_0.4.2      splines_4.1.2       readr_2.1.0         stringr_1.4.0       munsell_0.5.0       proxy_0.4-26       
[25] broom_0.7.10        compiler_4.1.2      xfun_0.28           pkgconfig_2.0.3     mgcv_1.8-38         htmltools_0.5.2     tidyselect_1.1.1    tibble_3.1.6       
[33] gridExtra_2.3       codetools_0.2-18    viridisLite_0.4.0   fansi_0.5.0         tzdb_0.2.0          withr_2.4.2         crayon_1.4.2        MASS_7.3-54        
[41] grid_4.1.2          nlme_3.1-153        gtable_0.3.0        lifecycle_1.0.1     DBI_1.1.1           huxtable_5.4.0      magrittr_2.0.1      units_0.7-2        
[49] scales_1.1.1        KernSmooth_2.23-20  stringi_1.7.5       farver_2.1.0        viridis_0.6.2       doParallel_1.0.16   ellipsis_0.3.2      generics_0.1.1     
[57] vctrs_0.3.8         iterators_1.0.13    tools_4.1.2         glue_1.5.0          purrr_0.3.4         hms_1.1.1           parallel_4.1.2      fastmap_1.1.0      
[65] yaml_2.2.1          colorspace_2.0-2    cluster_2.1.2       classInt_0.4-3      knitr_1.36          patchwork_1.1.1
BlasBenito commented 2 years ago

Hi Michael, Thank you for being so interested in the package and your thoughtful and well-developed comment. I truly appreciate it! I think you are right. On failure, the_feature_engineer() should yield a list with named objects, as you suggest. I will fix that in the development version ASAP.

Do you think it'd be alright if the output $data slot carried the original data if the function cannot find any meaningful interactions? I think that'd facilitate running automated workflows, but I'd be happy to hear what you think about it.

Cheers, Blas

BlasBenito commented 2 years ago

I updated the function in the development branch of the repo. Please, try it when you can, and let me know if it works as you'd like!

mikoontz commented 2 years ago

I'll give a try when I can! And to your question, my naive thinking would be to expect the $data column to just have a copy of the original data if there are no new interactions to add, so I think your approach sounds like the right one!