SimonDedman / gbm.auto

Machine-learning Boosted Regression Tree software suite for species distribution modelling in R
https://doi.org/10.1371/journal.pone.0188955
Other
18 stars 6 forks source link

Tune/Auto/step: Optimise parameters #22

Open SimonDedman opened 6 years ago

SimonDedman commented 6 years ago

Done? See Erik Franklin's code.

Trial & error iterative approach to determine the optimal lr for a data set? How? Stop @ whole percentages? Try OPTIM function & see http://r.789695.n4.nabble.com/Optimization-in-R-similar-to-MS-Excel-Solver-td4646759.html Possibly have an option to start with this in the R function. Separate function? Maybe do as separate function then feed into this so the outputs are jkl? or make one uber function but can use all 3 separately. Uberfunction doesn't need the loops? optim: use Method "L-BFGS-B" require(optimx) see: https://stats.stackexchange.com/questions/103495/how-to-find-optimal-values-for-the-tuning-parameters-in-boosting-trees/105653#105653 The caret package in R is tailor made for this. Its train function takes a grid of parameter values and evaluates the performance using various flavors of cross-validation or the bootstrap. The package author has written a book, Applied predictive modeling, which is highly recommended. 5 repeats of 10-fold cross-validation is used throughout the book. For choosing the tree depth, I would first go for subject matter knowledge about the problem, i.e. if you do not expect any interactions - restrict the depth to 1 or go for a flexible parametric model (which is much easier to understand and interpret). That being said, I often find myself tuning the tree depth as subject matter knowledge is often very limited. I think the gbm package tunes the number of trees for fixed values of the tree depth and shrinkage. https://www.youtube.com/watch?v=7Jbb2ItbTC4 see gbm.fixed in BRT_ALL.R - having processed the optimal BRT, might as well just use those details going forward rather than re-running the best one again.

SimonDedman commented 6 years ago

See this: https://www.r-bloggers.com/error-handling-in-r/ Currently didn't know how to continue when gbm.loop/auto/step crashes. Can use this to fail fast: try quickest highLR values to lowerLR values (divide by 10 step from 0.1?) until it runs, which constrains the possibility space . Would need to get a feel for how BF & LR (&TC?) combine in practice to produce final CV score. If they're hierarchical (LR>BF>TC) then could optimise in order, setting possibility space backwards. TC defined by n.expvars, BF by bfcheck. Optimise CV on LR alone, then optimise BF, then TC.

SimonDedman commented 6 years ago

https://stat.ethz.ch/R-manual/R-devel/library/base/html/options.html options(error) calls stop, could potentially save getwd() at the start of loop/auto/step then have options(error) return: setwd(initialwd) && stop

SimonDedman commented 5 years ago

Could parallelise param combinations and run loads of gbm.autos in a foreach loop (see https://github.com/SimonDedman/gbm.auto/issues/21) then compare processing time and CV score, also giving the option to check for absence of... report.csv? If the gbm.auto run fails then report.csv will be absent.

SimonDedman commented 5 years ago

Probably there's a relationship between sample size (/positive sample size) & variance, and optimal bfs & lrs. Certainly gbm.bfcheck gives you a range of bfs. If I can find this that probably obviates much (all?) of the work of optimising.

Could do this using BRTs!! What's the influence & relationship shape of tc lr bf on td score & how do they interact? 3D surface output potentially?

What's the relationship between gbm.bfcheck results and what will actually run?

SimonDedman commented 5 years ago

Optimise section as a wrapper around gbm.step for bin, and separately for gaus. Once it's optimised and run, the values will already be saved in the report csv. Can then re-use those as list(bin,gaus) params if re-running in future. For optimising, will ideally use largest LR, smallest BF >=0.5, and TC related to nvars.

SimonDedman commented 3 years ago

https://www.tidymodels.org/learn/work/tune-svm/ could do this within the tidymodels framework. Could conceptually rewrite the entirety of gbm.auto within that framework... See also https://dials.tidymodels.org/articles/Basics.html https://tune.tidymodels.org/articles/getting_started.html

SimonDedman commented 2 years ago

this already solves in python: https://scikit-optimize.github.io/stable/auto_examples/sklearn-gridsearchcv-replacement.html https://scikit-optimize.github.io/stable/ see Hulbert.etal.2020.Exponential build seismic energy Cascadia.pdf :

We rely on the XGBoost library for the gradient boosted trees’ regression, shown in Fig. 2 of the paper (and for results presented below). The problem is posed in a regression setting. Model hyperparameters are set by five-fold cross-validation, using Bayesian optimization (skopt library)

Anything comparable in R? https://softwarerecs.stackexchange.com/questions/25728/scikit-learn-for-r Could do in reticulate https://rstudio.github.io/reticulate/ ?

SimonDedman commented 2 years ago

See https://community.rstudio.com/t/hyperparameters-optimisation-frameworks-for-r-such-as-optuna-or-hyperopt-in-python/58457/2

SimonDedman commented 2 years ago

https://github.com/AnotherSamWilson/ParBayesianOptimization

SimonDedman commented 2 years ago

https://www.rdocumentation.org/packages/mlr/versions/2.19.0/topics/tuneParamsMultiCrit possible final solution

SimonDedman commented 1 year ago

see gbm.auto.extras folder, tryCatchTest.R

SimonDedman commented 1 year ago

https://proceedings.neurips.cc/paper_files/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf see TPE section