SimonDedman / gbm.auto

Machine-learning Boosted Regression Tree software suite for species distribution modelling in R
https://doi.org/10.1371/journal.pone.0188955
Other
15 stars 6 forks source link

Low-N Sensitivity analysis, bootstrapping, optimising #18

Open SimonDedman opened 6 years ago

SimonDedman commented 6 years ago

Chuck stuff: It does seem that the Gaussian models stop working reliably (I got individual runs to work for bull and sandbar sharks, but could never get the same parameters to work more than once) somewhere between 44 and 33 “positive” sets. I wonder if it might be worth a separate paper doing some kind of sensitivity analysis to figure out where that line actually is? [chuck]

A bootstrapping function. Essentially looping the same params, but removing random single/multiple rows/columns of data to test for e.g. time series effect even if single year splits aren't powerful enough to run a BRT on their own because of insufficient data. library(boot) boot() see https://www.r-bloggers.com/the-wonders-of-foreach/ So maybe this kind of analysis could fit into the coding for one of these? Or all 3 together. They're all clearly related. Repeating, sometimes taking stuff out, and collating answers at the end. SD: I'm just bouncing an idea around my head whereby the code could:

  1. run lower and lower (individual bin & gaus) lr/bf combos until it they failed
  2. repeat the last working one a few times to test for resilience
  3. Creep down a LITTLE bit to see if it can go a bit lower reliably (settable aggression parameter)
  4. Essentially iterate until its got it's lowest reliable number
  5. Describe the curve of lr/bf combo and (reliable) success rate, noting run time.
  6. Do this for a number of species
  7. Bootstrap to make the data poorer and poorer (manual after this point?)
  8. Throw all the results together to see if we have something that looks to reveal an underlying relationship, i.e. data strength vs gbm success & processing time
  9. Describe that relationship for various species.
  10. Are there commonalities?
SimonDedman commented 5 years ago

Much of this concept is subsumed within the gbm.tune plans; once gbm.tune is complete then steps 1-4 are done. Could then potentially add run time code (easy) but is that important? Will people want a tradeoff of less time for worse CV? I haven't yet worked with really really big data, maybe this is a thing?

Making data poorer bootstrapping: how much value gained by this? End up with a sense of a rule of thumb for what's likely to work, in terms of what? N, positiveN, variance, Nexpvars? Could bundle this into bfcheck? And/or update bfcheck & gbm.tune to be a one-stop-shop for pre-run testing? gbm.tune() params to be identical to gbm.auto (loop?)