andybega / forecaster2

Coup forecasts
https://www.predictiveheuristics.com/forecasts
MIT License
3 stars 0 forks source link

Try out a skinny forest HP strategy for the RF models #9

Open andybega opened 4 years ago

andybega commented 4 years ago

Instead of a relatively small number of decision trees that themselves operate on a lot of data and are fairly deep, try out an alternative strategy using a large number of trees, but where each tree is relatively shallow and only operates on a relatively small data sample. A variation of this is to also consider stratified sampling with downsampling for negative cases.

mlr3 uses the following defaults for ranger():

learner = mlr3::lrn("classif.ranger")
learner$param_set$default

The "sample.fraction" argument can be a vector giving the number of cases (relative to the total number of cases) to sample from each outcome factor class. See the bottom answer at https://stats.stackexchange.com/questions/171380/implementing-balanced-random-forest-brf-in-r-using-randomforests, and the linked ranger issues.

So something like sample.fraction = c(0.1, 0.9) for example should give a resampled dataset with 10% positive cases and same number of rows as original data.

Things to vary:

Chao, Liaw, and Breiman in the balanced random forest paper recommend drawing same number of cases for both classes, i.e. proportion is 1:1, or sample.fraction = c(0.5, 0.5) or something like that. Maybe that's a good starting point.

So basically in total, three tuning strategies:

  1. default RF with sample.fraction = 1 and optimizing over mtry and min.node.size; this I already have
  2. Balanced RF with sample.fraction = c(0.5, 0.5)
  3. Skinny RF with much larger number of trees, but smaller sample fractions, e.g. c(0.1, 0.1)