Review #3 - Githubissues

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to Jasper Snoek for taking the time to review this article.

Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest What type of contributions does this article make?: This article gives an intuitive explanation of the basics of Bayesian optimization and it is an enjoyable and interesting read. It does a nice job of visually demonstrating the impact and the behavior of a variety of choices made within Bayesian optimization. It does a great job visualizing the various acquisition functions in Bayesian optimization with the help of nice interactive plots and animations.

Advancing the Dialogue	Score
How significant are these contributions?	4/5

Comments

I think this is a really clean and concise introduction to Bayesian optimization and some of the nuances of the underlying strategy followed. Bayesian optimization is certainly dear to me and I appreciate having someone take the time to produce nice visualizations so I feel inclined to accept. There is a lot that could be added to this post, but I suppose in the spirit of the journal (i.e. short and crisp) this might be just right? In any case there are some underlying modeling issues that I would like corrected before this is accepted.

Outstanding Communication	Score
Article Structure	4/5
Writing Style	5/5
Diagram & Interface Style	4/5
Impact of diagrams / interfaces / tools for thought?	3/5
Readability	4/5

Comments

I think the article is structured well and flows nicely.
Writing style is very easy to follow and accessible. A couple of minor typos.
The diagrams are intuitive and I love having interactive plots of acquisition functions in Bayesian optimization.
The diagrams are really helpful, elegant and really drive the story. However, they aren’t terribly novel since this is exactly how we’ve been visualizing Bayesian optimization in papers and texts for years. The authors do turn them into animations, however, which is a neat upgrade from the static plots previously used.
Very readable. The authors side-step some of the more technical aspects of Bayesian modeling (e.g. how does a Gaussian process work and how do you fit it). However, for the purposes of this article that might be better.

Scientific Correctness & Integrity	Score
Are claims in the article well supported?	4/5
Does the article critically evaluate its limitations? How easily would a lay person understand them?	4/5
How easy would it be to replicate (or falsify) the results?	5/5
Does the article cite relevant work?	4/5
Does the article exhibit strong intellectual honesty and scientific hygiene?	4/5

Comments

I think there’s a lot of specifics, challenges and follow up work that isn’t referenced here. However, the authors do a good job of citing relevant work.
This is just an introductory exposition that would be really easy to replicate empirically.
Yes. There is a lot of relevant work missing, e.g. citations for Thompson sampling (http://proceedings.mlr.press/v84/kandasamy18a/kandasamy18a.pdf) but most of the relevant citations are there.
There is what I see as a critical modeling mistake in the diagrams and resulting analysis. The authors don’t adjust for the mean of the data, which results in weird and pathological behavior. I would like the authors to correct this and then resubmit (and then I think it should be ok). Detailed comments are below.

Detailed comments

Intro: Note the word “hyperparameter” can actually be contentious. The traditional definition of a hyperparameter is as a higher level model parameter influencing the parameters of the model. Under that definition things like learning rate, optimization parameters, etc. don’t really apply. So I always say Bayesian optimization is used to tune hyperparameters, optimization parameters and other model parameters.

Mining Gold: There’s a neat historical precedent to this that’s worth mentioning. The first use of Gaussian processes was actually to model ore density in South Africa. Applying Gaussian processes was initially called “Kriging” after Danie Krige (https://en.wikipedia.org/wiki/Danie_G._Krige) who used GPs to model the spatial density of ore deposits (and figure out where to drill).

“Active Learning”: “We cannot estimate the gold estimate” sounds awkward. Maybe rephrase.

“Gaussian processes”: “Gaussian processes regression” -> Gaussian process regression “(Smoothess) Such” -> “(Smoothness). Such”

“Prior model”: The Matern kernel is a specific choice of prior that is worth spending some time rationalizing. It’s probably a good idea to introduce the concept of a kernel and how the choice of kernel corresponds to a prior over functions. Then describe how the Matern lets you determine the smoothness of the prior (i.e. a Matern 5/2 means twice differentiable). How does that correspond to your assumptions about gold smoothness?

“UCB”: “This is because while the variance or uncertainty is high for such points, the posterior mean is low.” The posterior mean is low because the prior mean is set to 0. You could subtract the mean from the observed data to make it 0 mean (then in your case the mean would go to ~5 instead of 0 and the acquisition function would be higher further from the data). It would be better to make the mean a hyperparameter of the GP and optimize it or integrate it out.

“Probability of Improvement”: “given same exploitability” -> “given the same...”

The “hero plot” is neat. Though I’m not sure I understand why it’s called a hero plot.

“(can be identified by the grey translucent area” is missing closing parens.

“Expected Improvement”: These plots are neat and convey the intuition really nicely. However, the EI values are tiny and the behavior does not follow what I have seen for EI. I suspect the optimization routine is under exploring because of the zero-mean issue I brought up above. Specifically, for a stationary kernel, the GP posterior will return to the mean when moving away from the data. In this case, it’s returning to 0, which is silly since 0 is not the mean of the observations you have seen. I suspect the routine will be much better behaved if you subtract out the mean of the data, fit the model, and then add the mean back in.

“Gaussian Process Upper Confidence Bound (GP-UCB)”: I like the discussion of regret. However, I don’t think Srinivas et al. introduced GP-UCB and other acquisition functions also minimize regret. Instead they derived some elegant bounds on regret under the GP-UCB acquisition function. I think I would rephrase this to say something like: “Srinivas et al. developed a schedule for \beta that they theoretically demonstrated minimizes cumulative regret. The schedule is T_t … “

We want to thank the reviewer for reviewing the article in such depth and providing actionable improvements. We will address the specific comments below.

Intro: Note the word “hyperparameter” can actually be contentious. The traditional definition of a hyperparameter is as a higher level model parameter influencing the parameters of the model. Under that definition things like learning rate, optimization parameters, etc. don’t really apply. So I always say Bayesian optimization is used to tune hyperparameters, optimization parameters and other model parameters.

We have addressed this issue by making modifications to our abstract. Please see the following for the exact modifications.

FROM: ABSTRACT We are increasingly getting used to deep(er) neural networks! These networks often come with a huge number of hyperparameters: like the number of layers, the dropout rate, the learning rate, among others. How do we efficiently tune these hyperparameters to optimize our machine learning model? In this article, we will talk about Bayesian Optimization (BO) - which is an effective suite of techniques often used to efficiently tune the hyperparameters. Besides being used in tuning hyperparameters, BO is a general suite of techniques for optimizing a black-box function. Before we talk in depth about Bayesian Optimization and its applicability in hyperparameter tuning, we will look into maximizing (optimizing) a black box function!

TO: ABSTRACT We are increasingly getting used to machine learning algorithms with a large number of hyperparameters along with numerous optimization parameters. For example, neural networks with hyperparameters like the number of layers or optimization parameters like the learning rate, the dropout rate, among others. Or random forests with hyperparameters such as the number of trees, and the maximum depth of an individual tree. How do we efficiently tune these to optimize our machine learning model? In this article, we will talk about Bayesian Optimization (BO) - which is an effective suite of techniques often used to efficiently tune the hyperparameters, optimization parameters, and other model parameters. Besides being used in the preceding scenarios, BO is a general suite of techniques for optimizing any black-box function. Before we talk in-depth about Bayesian Optimization and its applicability in tuning model parameters, we will look into maximizing (optimizing) a black box function!

Mining Gold: There’s a neat historical precedent to this that’s worth mentioning. The first use of Gaussian processes was actually to model ore density in South Africa. Applying Gaussian processes was initially called “Kriging” after Danie Krige (https://en.wikipedia.org/wiki/Danie_G._Krige) who used GPs to model the spatial density of ore deposits (and figure out where to drill).

We have added a few sentences to our article highlighting this interesting fact.

FROM: MINING GOLD Let us start the discussion with the example of mining for gold. Our goal is to mine for gold in a new, unknown land. For now, let us make a simplifying assumption, the gold content lies in a one-dimensional space, i.e., we are talking gold distribution only about a line. Our aim is to find the location along this line where we would get the maximum return and drill at that location.

TO: MINING GOLD Let us start the discussion with the example of gold mining. Our goal is to mine for gold in a new, unknown land. Our example below is heavily inspired by the historical fact that one of the first uses of Gaussian Processes (GPs) was to model ore density in South Africa. Prof. Danie Krige used GPs at the time called “Kriging” to model the spatial density of ore deposits. For now, let us make a simplifying assumption, the gold content lies in a one-dimensional space, i.e., we are talking about gold distribution only about a line. Our aim is to find the location along this line where we would get the maximum gold in a few number of drilling (as drilling is expensive).

“Active Learning”: “We cannot estimate the gold estimate” sounds awkward. Maybe rephrase.

FROM: In machine learning problems, often, unlabelled data is very easily available, but labeling could be an expensive task ... But, without drilling, we can not estimate the gold estimate. Active learning would solve this problem by posing a smart strategy to choose the next drilling site. While there are various methods and techniques in the active learning literature, for the sake of brevity, we will look only at uncertainty reduction, which chooses the next query point as the one our model is the most uncertain about. One of the ways we can reduce the uncertainty is by choosing the point at which we have the maximum variance (we are most uncertain). We will now look into Gaussian Processes, which not only give us predictions, but also uncertainty estimates, which will be useful for active learning.

TO: In machine learning problems, often, unlabelled data is easily available, but labeling could be an expensive task ... But, without drilling, we can not estimate the gold concentration. Active learning would solve this problem by posing a smart strategy to choose the next drilling site. While there are various methods and techniques in the active learning literature, for the sake of brevity, we will look only at uncertainty reduction. This method chooses the next query point as the one our model is the most uncertain about. One of the ways we can reduce the uncertainty is by choosing the point at which we have the maximum variance (we are most uncertain).

“Gaussian processes”: “Gaussian processes regression” -> Gaussian process regression “(Smoothess) Such” -> “(Smoothness). Such”

We have made the following changes to the article to address the above comments. Further, We removed the line, “By using Gaussian processes, we assume that the gold distribution of nearby points is similar (smoothness) Such an assumption is usually valid.”, because technically we can use a kernel k(x1, x2) = 0 which essentially models white noise (which isn’t smooth).

FROM: One might want to look at this excellent distillpub article on Gaussian Processes. We will be using Gaussian Processes regression to model the gold distribution along the one-dimensional line. By using Gaussian processes, we assume that the gold distribution of nearby points is similar (smoothness) Such an assumption is usually valid.

TO: One might want to look at this excellent distillpub article on Gaussian Processes. We use Gaussian Process regression to model the gold distribution along the one-dimensional line.

“Prior model”: The Matern kernel is a specific choice of prior that is worth spending some time rationalizing. It’s probably a good idea to introduce the concept of a kernel and how the choice of kernel corresponds to a prior over functions. Then describe how the Matern lets you determine the smoothness of the prior (i.e. a Matern 5/2 means twice differentiable). How does that correspond to your assumptions about gold smoothness?

We have added a few sentences that validate our choice for keeping prior to be a Matern 5/2 kernel.

FROM: We will choose a simple prior to the gold content along a one-dimensional space. Our prior assumes a smooth relationship between points via a Matern kernel. The black line in the graph below denotes the knowledge we have about the gold content without drilling even at a single location.

TO: The choice of prior heavily depends on the initial belief about the properties of the black-box function. These properties include periodicity, smoothness, etc. In our case, we consider gold distribution to be smooth, i.e. two points close in space will have similar gold content. Given our surrogate is a GP, we can use kernels to set a prior over the functions that fa, i.e. two points close in space will have similar gold content.vor this property. Our prior uses the Matern 5/2 kernel. Our choice of Matern 5/2 kernel can be attributed to its property of favoring doubly differentiable functions. The black line in the graph below denotes the knowledge we have about the gold content without drilling even at a single location.

In contrast, Matern 3/2 favors singly differentiable functions. What this essentially implies is that the prior of our GP would favor doubly differentiable functions, which we assume the gold distribution to follow. > “UCB”: “This is because while the variance or uncertainty is high for such points, the posterior mean is low.” The posterior mean is low because the prior mean is set to 0. You could subtract the mean from the observed data to make it 0 mean (then in your case the mean would go to \~5 instead of 0 and the acquisition function would be higher further from the data). It would be better to make the mean a hyperparameter of the GP and optimize it or integrate it out. > “Expected Improvement”: These plots are neat and convey the intuition really nicely. However, the EI values are tiny and the behavior does not follow what I have seen for EI. I suspect the optimization routine is under exploring because of the zero-mean issue I brought up above. Specifically, for a stationary kernel, the GP posterior will return to the mean when moving away from the data. In this case, it’s returning to 0, which is silly since 0 is not the mean of the observations you have seen. I suspect the routine will be much better behaved if you subtract out the mean of the data, fit the model, and then add the mean back in. The first comment talks about the issue of incorrectly modeling GPs and its consequences on values of the posterior uncertainty and mean. We have updated the article with the suggestion in the comment which improves the BO framework significantly and we no longer face the first issue. We would like to thank the reviewer for the second suggestion. We have incorporated the suggested modeling technique in the newer version of the article. In the newer version of the article, we set our prior to have a mean of five (approximately the data set mean) instead of zero. The newer setting completely avoids the pathological behavior shown by the earlier setting the comment is talking about. There was a significant improvement to the whole BO framework incorporating this suggestion. This can be seen in the plots below which say we are able to better optimize our black-box function in a fewer number of iterations. ![](https://i.imgur.com/cgxZKcj.png) ![](https://i.imgur.com/MOe9k4X.png) Below are the gifs that show the typical runs between the two modelling techniques. We see the updated version of the article uses the modelling technique that doesn’t show the pathological behaviour mentioned in the comment. ![](https://i.imgur.com/uJYZ7qq.gif) ![](https://i.imgur.com/n7lBbaO.gif) > “given same exploitability” -> “given the same...” Based on the initial review provided by reviewer 2, we condensed the discussion therefore we no longer have the above section where the typo occurred. > The “hero plot” is neat. Though I’m not sure I understand why it’s called a hero plot. We no longer call the interactive plot “hero plot”. Further, the notion of hero plot was borrowed from other Distill.Pub articles. > “(can be identified by the grey translucent area” is missing closing parens. We have added the missing closing parenthesis. > “Gaussian Process Upper Confidence Bound (GP-UCB)”: I like the discussion of regret. However, I don’t think Srinivas et al. introduced GP-UCB and other acquisition functions also minimize regret. Instead they derived some elegant bounds on regret under the GP-UCB acquisition function. I think I would rephrase this to say something like: “Srinivas et al. developed a schedule for \beta that they theoretically demonstrated minimizes cumulative regret. The schedule is T_t … “ We have modified the discussion where we now instead mention that Srinivas et. al. developed a schedule instead of introducing GP-UCBs. FROM: ![](https://i.imgur.com/eSeWtS4.png) TO: ![](https://i.imgur.com/ems1Y5t.png)

distillpub / post--bayesian-optimization

Review #3 #11

Comments

Comments

Comments

Detailed comments