Question about random forest surrogate and high dimensional data

wushanyun64 commented 1 year ago

Hi Openbox,

I have two quick questions that would really appreciate if you guys could help me better understand.

For the random forest surrogate model, what is the method that openbox use to get the variance of the prediction? Based on the source code I see, it seems to bag the random forest model with different random state and take the variance of the bag. I am wondering if this is a statistically valid method and if so are there any reference talking about this approach?
I see some other packages use a jackknife to estimate the variance of RF, are you guys considering implementing this?
Can Openbox handle high-dimensional search space?

Thanks in advance,

jhj0411jhj commented 1 year ago

For your questions,

The probabilistic random forest is proposed in SMAC[1] (source code: https://github.com/automl/random_forest_run). The PRF model computes predicted mean/variance of its inner regression trees, not mean/var of bag of RFs. (The source code you see might be a deprecated version. The correct one is here: https://github.com/PKU-DAIR/open-box/blob/master/openbox/surrogate/base/rf_with_instances.py) [1] Hutter, Frank, Holger H. Hoos, and Kevin Leyton-Brown. "Sequential model-based optimization for general algorithm configuration." Learning and Intelligent Optimization: 5th International Conference, LION 5, Rome, Italy, January 17-21, 2011. Selected Papers 5. Springer Berlin Heidelberg, 2011.
Do you mean sampling multiple sets of training data and training different RFs? Could you provide the package names or papers? We will have a look into it but may not implement it in OpenBox since PRF works pretty well now.
Vanilla BO suffers from the curse of dimensionality. There are several new methods (e.g. [2][3]) proposed to enhance BO in high-dimensional problems. However, algorithms for high-dimensional optimization are still under development in OpenBox. [2] Eriksson, David, et al. "Scalable global optimization via local bayesian optimization." Advances in neural information processing systems 32 (2019). [3] Wang, Linnan, Rodrigo Fonseca, and Yuandong Tian. "Learning search space partition for black-box optimization using monte carlo tree search." Advances in Neural Information Processing Systems 33 (2020): 19511-19522.

wushanyun64 commented 1 year ago

Thank you for the explanation, here is the paper I was talking about for the second point: https://jmlr.org/papers/volume15/wager14a/wager14a.pdf

About the third question, my understanding is that using RF instead of GP will partially solve the high-dimension problem, since GP is typically not good for high-dimension data. If we choose RF as the surrogate, what other problem we'll encounter when dealing with high dimensional data?

jhj0411jhj commented 11 months ago

Hi @wushanyun64, RF is better than GP in high-dimension problem. However, since data points are scarce compared to the high dimensionality of the space, traditional BO method tends to over explore the space and may behave like random search. Some high-dim methods tackle this problem by restricting the search in local area of existing points.

PKU-DAIR / open-box

Question about random forest surrogate and high dimensional data #64