HunterMcGushion / hyperparameter_hunter

Easy hyperparameter optimization and automatic result saving across machine learning algorithms and libraries
MIT License
704 stars 100 forks source link

Q: what is RandomForestOptPro exactly? #194

Closed chwang1991 closed 4 years ago

chwang1991 commented 5 years ago

Hi, I'm a ML rookie, and I wonder what is RandomForestOptPro (and others like GradientBoostedRegressionTreeOptPro) and how it works?

I know basic parameter-tuning techs like the Bayesian opt, but I never heard "RandomForest Opt" before. Is there any paper can explain it or should I check somewhere in your documentation (I tried but failed :3 )

and a might-be stupid question: how to extract the best estimator from the experiment? So far I guess the only way is to check the leaderboad then get the best estimator's key then check the json file? Could it be more straightforward?

BTW thanks for developing such a GREAT tool!

chwang1991 commented 5 years ago

OK I see there is a "hyperparameter_key" column in the leaderboard file, so can I use the key to directly get the hyperparameter combo?

HunterMcGushion commented 5 years ago

@chwang1991,

Thanks for opening this issue! Let me start by saying that your questions are definitely not “stupid”! These are fantastic questions, and I’m sure many others have them as well. So thank you for bringing them to my attention, because it means that I probably need to improve the documentation.

Regarding RandomForestOptPro and GradientBoostedRegressionTreeOptPro, you may already be clear on this, but it helps me to remember that they’re just normal ML algorithms. Hyperparameter optimization is just ML on top of ML. These two algorithms (I call them “OptPros” or “Optimization Protocols” because they’re designed for hyperparameter optimization) are built on top of SKLearn’s RandomForestRegressor and GradientBoostingRegressor, respectively. Since you’re familiar with Bayesian optimization, it could help to remember that our BayesianOptPro is similarly built on top of SKLearn’s GaussianProcessRegressor.

Each of the Optimization Protocols is just a wrapper around some existing ML algorithm to enable us to use it for hyperparameter optimization. So in order to understand the two OptPros you mentioned, I would recommend reading the documentation for the above-mentioned SKLearn classes that they use internally: RandomForestRegressor and GradientBoostingRegressor. Of course, these classes are the base_estimators for each OptPro, and we also make use of things like acquisition functions in order to estimate the utility of each proposed set of hyperparameters, but this is also the case in standard Bayesian optimization.

In the end, the most significant difference between the BayesianOptPro with which you’re familiar and RandomForestOptPro, for example, is that BayesianOptPro uses GaussianProcessRegressor as its base_estimator, whereas RandomForestOptPro uses RandomForestRegressor.

I know my response hasn’t really touched on the technical differences between the different OptPros, but like I said, to get a better idea of the specific differences in behaviors, you can see the documentation of the base_estimator classes. For HH, it makes sense to offer a wider range of OptPros because it’s so easy to switch between them due to the fact that HH automatically records and reuses Experiments during Optimization. Sometimes you may find that BayesianOptPro simply isn’t working as well as you want. With HH, you can simply stop your current OptPro and pick up learning with a different OptPro to see how things go—without losing everything your previous OptPro learned. In my problems, I find it often helps to get the different perspective of another OptPro. In fact, I’ll often switch between all of them, giving each 10 iterations, then moving on to the next OptPro. I don’t know of any other library that enables this sort of diverse exploration of the problem space while still retaining all of the data collected by previous optimization rounds.

Turning to your question about finding the best Experiment, you are correct that the key is using the leaderboard file. This could definitely be better documented, so I apologize. What you’ll want to do is check the Leaderboard for the experiment_id corresponding to the score you want. Then go to the “Experiments/Descriptions” directory. Each of the JSON files here are named after an experiment_id in the Leaderboard. So open up the JSON file for the Experiment you want, and you’ll find all of the important information about the Experiment, including scores, execution times, algorithm used, and of course, hyperparameters and feature engineering steps used. Under the “hyperparameters” key you can find all the hyperparameters used for the Experiment—even the ones you didn’t explicitly declare—so you can thoroughly recreate the Experiment.

Both of these topics could definitely be documented better, so thank you very much for asking these fantastic questions! Please let me know if you have any further questions or feedback, and thank you for your support! I really appreciate it!

chwang1991 commented 4 years ago

I was surprised that today there are still open-sourcers willing to explain so in detail like you....

You not only solved my question but also help extend my knowledge -- you are right, hyper-tuning techs are just ML on top of ML! Thank you so much!