corels / rcppcorels

R Bindings to the Certifiably Optimal Rule Lists (Corels) Learner
45 stars 3 forks source link

Using Corels with tidy packages #6

Closed billster45 closed 4 years ago

billster45 commented 4 years ago

Hi, thanks for making Corels availalbe in R. I've enjoyed playing with Corels using tidy packages. This is a bit hacky but I've put the functions I was using into a package (as much to learn how to build packages). Pleased with how well alluvial plots visualise Corels rules: https://github.com/billster45/tidycorels

eddelbuettel commented 4 years ago

That looks nice and promising from a first glance at the (very long, maybe split into vignette[s] ?) README.

One thing I am somewhat concerned about is the tail of dependencies. For package development, many experienced R programmers prefer to keep it a little lighter than this:

R> pkgs <- c("corels", "dplyr", "ggplot2", "tidyselect", "magrittr", "stringr", "easyalluvial", "recipes")
R> deps <- tools::package_dependencies(pkgs, db, recursive=TRUE)  # returns deps per pkg in list
R> unique(sort(unlist(unname(deps))))
 [1] "assertthat"   "backports"    "callr"        "caret"        "class"        "cli"         
 [7] "codetools"    "colorspace"   "crayon"       "data.table"   "desc"         "digest"      
[13] "dplyr"        "e1071"        "ellipsis"     "evaluate"     "fansi"        "farver"      
[19] "forcats"      "foreach"      "generics"     "ggalluvial"   "ggplot2"      "ggridges"    
[25] "glue"         "gower"        "graphics"     "grDevices"    "grid"         "gridExtra"   
[31] "gtable"       "hms"          "ipred"        "isoband"      "iterators"    "KernSmooth"  
[37] "labeling"     "lattice"      "lava"         "lazyeval"     "lifecycle"    "lubridate"   
[43] "magrittr"     "MASS"         "Matrix"       "methods"      "mgcv"         "ModelMetrics"
[49] "munsell"      "nlme"         "nnet"         "numDeriv"     "pillar"       "pkgbuild"    
[55] "pkgconfig"    "pkgload"      "plyr"         "praise"       "prettyunits"  "pROC"        
[61] "processx"     "prodlim"      "progress"     "ps"           "purrr"        "R6"          
[67] "randomForest" "RColorBrewer" "Rcpp"         "recipes"      "reshape2"     "rlang"       
[73] "rpart"        "rprojroot"    "rstudioapi"   "scales"       "splines"      "SQUAREM"     
[79] "stats"        "stats4"       "stringi"      "stringr"      "survival"     "testthat"    
[85] "tibble"       "tidyr"        "tidyselect"   "timeDate"     "tools"        "utf8"        
[91] "utils"        "vctrs"        "viridisLite"  "withr"       
R> 

Even if we took the two top-level plotting packages out and focus just on processing it is a wee bit long:

R> pkgs1 <- c("corels", "dplyr", "tidyselect", "magrittr", "stringr", "recipes")
R> deps1 <- tools::package_dependencies(pkgs1, db, recursive=TRUE)
R> unique(sort(unlist(unname(deps1))))
 [1] "assertthat" "class"      "cli"        "crayon"     "digest"     "dplyr"      "ellipsis"  
 [8] "fansi"      "generics"   "glue"       "gower"      "graphics"   "grDevices"  "grid"      
[15] "ipred"      "KernSmooth" "lattice"    "lava"       "lifecycle"  "lubridate"  "magrittr"  
[22] "MASS"       "Matrix"     "methods"    "nnet"       "numDeriv"   "pillar"     "pkgconfig" 
[29] "prodlim"    "purrr"      "R6"         "Rcpp"       "rlang"      "rpart"      "splines"   
[36] "SQUAREM"    "stats"      "stringi"    "survival"   "tibble"     "tidyr"      "tidyselect"
[43] "timeDate"   "tools"      "utf8"       "utils"      "vctrs"      "withr"     
R> 

We are also looking into into some possible code aggregation / re-organisation of the R package along with the Python package and C++ backend. But as you were able to work off the basic corels interface hopefully this should not affect you. We hope to "eventually" make this a little nicer and easier to use.

billster45 commented 4 years ago

Thanks for the feedback on dependencies and how to pull them out. Now I know the basic mechanics of package development I can pay attention to that part. Looking forward to later versions as time allows!

eddelbuettel commented 4 years ago

Oh, no need to close it :)

I'd be interested in some downstream processing. I happen to really enjoy data.table which is recognised as being both the fastest at many tasks and also reliably and conservatively managed with zero depends (because to some of us lightweight is the right weight). It may even make a nice side-by-side case study about "here is how you aggregate with paradigm A" and "here is how you do you with paradigm B".

billster45 commented 4 years ago

Thanks that's great advice re data.table and an excuse to build skills in it. Have used this before. And as you say, a way to build packages with fewer dependencies. And many thanks for your tinyverse link to all your writinngs on this that explain the less is more view beautifully. I sure picked the wrong person to share a first attempt bloated package with!

Re Corels itself, amazes me how the simple rules are more accurate than Rebecca's walkthrough on the same test data. Though granted I expect her aim is to explain rather than accuracy on test.

Have also compared Corels to another tidy models example. It uses open German credit data. The Corels performance on the same test data was pretty close to the best performing XGBoost model in that repo. And it could be aruged that the small loss of accuracy is more than made up for with transparent rules vs. a black box algorithm. Particulary for credit risk where transparency is so desireable and a part of regulator's checks.

Thanks for taking the time to look at this and share advice.

eddelbuettel commented 4 years ago

I sure picked the wrong person to share a first attempt bloated package with!

:grinning: Sadly this has come a very contested and "political" topic. Not too many people these days come at it with fresh eyes aiming to evaluate technical design decisions on technical merits. Oh well. BTW re the compariosn, one thing I once enjoyed (and then got good feedback on) was this writeup which is a "timier" version of this original.

Very interesting what you write about the predictive performance here. We should kick that can a little further down the road!

billster45 commented 4 years ago

I've reduced dependecies to 3!
packrat:::recursivePackageDependencies("tidycorels",lib.loc = .libPaths()[1]) [1] "Rcpp" "corels" "data.table"

New version here displayed with pkgdown: https://billster45.github.io/tidycorels/

Thank you. Using your feedback I both learnt a lot and am enjoying improving it. Quick thoughts:

  1. data.table as well as faster led me to a much simpler method of just reshaping the data frame before converting to text for corels.
  2. Was surprised how using base functions to manipulate text instead of stringr also led to cleaner code. I'll use them more often now instead of stringr.
  3. When adjusting corels arguments to improve performance on training data, the argument with the most impact on the rules returned is regularization.
  4. tidycorels is still a hacky package helping me build skills (though a lot cleaner with your feedback). It would be great if corels was added to parsnip. I expect users through parsnip would primarly want to tune regularization.
  5. curiosity_policy when set as 4 (depth first search) never seems to complete for me.
billster45 commented 4 years ago

Hi, I realised the alluvial plot was not quite right. And actually it was doing Corels a disservice by not properly showing how little information the rules need! Have corrected it so only the information the rules use to classify is in the alluvial_df dataframe that tidycorels returns.

And have altered tidycrorels to return two node and edge dataframes that let you create a D3 network sankey network diagram of the rules. With further manipulation of the node and edge dataframes have also shown how you can create a hierarchical network diagram of the rules too. I was trying to create something like this with the visNetwork package but it's not quite right yet.  

Was also interested in the accuracy of each rule in the order that they fire. So it now returns another data frame of the performance of each rule. It's interesting then to compare performance between train and test data and see which rule(s) do well (or not so well) in test.

You can see all these updates demonstrated from this point downwards in the diabetes data example