Review #2 - Githubissues

The following peer review was solicited as part of the Distill review process.

The reviewer chose to keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service they offer to the community.

Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest What type of contributions does this article make?: The article provides an exposition of bayesian optimization methods. It motivates the use of Bayesian Optimization (BO), and gives several examples of applying BO for different objectives.

Advancing the Dialogue	Score
How significant are these contributions?	2/5

Comments

I think the main contribution of the current article is in the simulations, which illustrate BO in practice. However, I believe the article does not do a great job of explaining the setup and foundations of BO, and of unifying the various examples under a common framework. In this sense, I don't believe its exposition is a significant contribution.

For example, I think the following short note (which the authors cite) does an excellent job of briefly introducing the BO formalism, and presenting different instanciations of BO (for different objective functions) under the same underlying framework: https://www.cse.wustl.edu/~garnett/cse515t/spring_2015/files/lecture_notes/12.pdf

Outstanding Communication	Score
Article Structure	3/5
Writing Style	3/5
Diagram & Interface Style	4/5
Impact of diagrams / interfaces / tools for thought?	3/5
Readability	2/5

Comments

The article is fairly long for the core content, and it's easy to get lost in the details of the various examples, and lose track of the main points.
It does not use jargon, but the writing is a bit verbose and could be condensed.
I would omit the interactive figure -- there are too many moving parts, and it confuses more than clarifies. The non-interactive simulations are good, although there are perhaps too many of them -- just a few simulations would convey the point (that different objective functions result in different optimization procedures).
The diagram format is standard (as simulations of GP), although there is value in doing and showing these simulations in the context of baysean optimization.
Assuming knowledge of Gaussian Processes, this topic (BO with GP prior) is not very difficult. In particular, it can be described simply as: -- Assume a Gaussian Process prior on the ground-truth function F. -- Formalize your objective (eg. sampling a point 'x' with maximum expected value of F(x), or maximizing the probability that F(x) > F(x_j) for all previously-sampled points x_j) -- Use the existing samples {(x, F(x))} to compute the posterior of F given the samples (under the GP prior), and maximize your objective function under the posterior. This yields a choice of new point to sample. -- (Different "acquisition functions" simply correspond to different objectives in step (2)).
The current article is fairly long for conveying the above point, and it includes many details which can be distracting (eg, equations for the exact form of the maximization in (3), which does not add much conceptually).
Concretely, I suggest cutting a lot of the discussion about details of various acquisition functions, and just presenting a few examples to convey the point that different objectives (Step 2) yield different optimization procedures.

Scientific Correctness & Integrity	Score
Are claims in the article well supported?	3/5
Does the article critically evaluate its limitations? How easily would a lay person understand them?	4/5
How easy would it be to replicate (or falsify) the results?	5/5
Does the article cite relevant work?	5/5
Does the article exhibit strong intellectual honesty and scientific hygiene?	5/5

Comments

Minor points:

For comparing vs. a random strategy, I would just compare against a truly random strategy, instead of using a "random acquisition function" within the BO framework as a proxy for this.
In the "comparison" plots, the random strategy appears to do about as well the Bayesian Optimization -- which means this is not a setting that convinces me that BO is powerful.
In the "comparison" plot, different acquisition functions correspond to different objectives. However, we are evaluating them all under the same objective, which is somewhat unfair. In particular, if the objective is well-specified and the ground-truth is actually drawn from a GP prior, then BO should exactly maximize the expected objective value (ie, it should be the optimal thing to do, if the assumptions hold).

Major points:

With respect to the scientific content, my main issue is that there is no clear distinction made between:

Bayesian optimization as a formal framework, with provable optimality guarantees.
Bayesian optimization as it's used in practice (e.g. even if the true ground-truth is not drawn from a gaussian process, we can still apply BO methods and hope to get something reasonable, though not provably so).

These two viewpoints are conflated throughout the article. For example, in the section "Formalizing Bayesian Optimization", the points described are actually heuristics about setting (B), not formalisms in the sense of (A).

This confusion also makes it difficult to see how different acquisition functions relate to each other, and what our actual objective is in choosing between different acquisition functions.

We want to thank the reviewer for reviewing the article with such depth. We went through every sentence and have addressed the points raised in our updated article.

We would like to point out to the editors that we had received an unofficial review from the reviewer before the official reviews. We addressed the issues that were raised in the unofficial review but the updated article couldn’t be made available to the official reviewer due to some technical issue. All official reviews posted, therefore, are on an earlier version of the article without the initial inputs.

We will now address the specific comments which are a part of a few main categories of issues.

Communication

Verbose and Lengthy Article

The article is fairly long for the core content, and it's easy to get lost in the details of the various examples and lose track of the main points.

It does not use jargon, but the writing is a bit verbose and could be condensed.

The current article is fairly long for conveying the above point, and it includes many details which can be distracting (eg, equations for the exact form of the maximization in (3), which does not add much conceptually).

Concretely, I suggest cutting a lot of the discussion about details of various acquisition functions, and just presenting a few examples to convey the point that different objectives (Step 2) yield different optimization procedures.

On getting similar suggestions earlier, we had agreed to the points raised and updated our article. The updated article is significantly condensed and contains collapsible where we thought we were losing track of the main point behind the article. The difference is best conveyed looking at both the variants of the articles. Please see the older and newer articles here and here, respectively. Concretely:

FROM: Acquisition Functions

Initially, we had the following sections on different acquisition functions.

Upper Confidence Bound (UCB)
Probability of Improvement (PI)
Expected Improvement (EI)
PI VS. EI
Gaussian Process Upper Confidence Bound (GP-UCB)
Probability of Improvement + λ × Expected Improvement (EI-PI)

TO: Acquisition Functions

We compressed the following sections:

Probability of Improvement (PI)
Expected Improvement (EI)

Compressed and formed collapsible for the following sections:

Upper Confidence Bound (UCB)
PI VS. EI
Gaussian Process Upper Confidence Bound (GP-UCB)
Probability of Improvement + λ × Expected Improvement (EI-PI)

FROM: Examples

Furthermore, we had three real-life examples where we showed the BO framework being used for hyperparameters optimization.

Support Vector Machine
Random Forests
Convolutional Neural Network

TO: Examples we formed collapsible for the following sections:

Random Forests
Convolutional Neural Networks

Below we see a collapsible in action.

Interactive figure without context

I would omit the interactive figure -- there are too many moving parts, and it confuses more than clarifies.

This is an excellent point that refers to the figure trying to explain the effect of ϵ on a particular acquisition function (Probability of Improvement). The main issue that the initial review also pointed out was that the reader was not able to understand what the figure is trying to show. We have moved the figure after the section where we explain the Probability of Improvement. This provides a better context for the reader about the figure. Concretely:

FROM: “Hero” (interactive) plot being at the top of the article.

TO: “Hero” (interactive) plot now after the relevant section “Probability of Improvement”.

Too many non-interactive simulations

The non-interactive simulations are good, although there are perhaps too many of them -- just a few simulations would convey the point (that different objective functions result in different optimization procedures).

This point is similar to the point raised in the section “Verbose and Lengthy Article”. We have addressed this issue by removing some sections and forming collapsible mentioned above. We further introduced a slider for providing better control to the reader. Initially, all the non-interactive figures were gifs, now these figures can be better controlled by the user.

Bird's eye view missing

Assuming knowledge of Gaussian Processes, this topic (BO with GP prior) is not very difficult. In particular, it can be described simply as:

Assume a Gaussian Process prior on the ground-truth function F.

Formalize your objective (eg. sampling a point 'x' with maximum expected value of F(x), or maximizing the probability that F(x) > F(x_j) for all previously-sampled points x_j)

Use the existing samples {(x, F(x))} to compute the posterior of F given the samples (under the GP prior), and maximize your objective function under the posterior. This yields a choice of new point to sample.

(Different "acquisition functions" simply correspond to different objectives in step (2)).

This is a great brief on BO and we have addressed this point by incorporating a Bayesian Optimization primer at the end of our updated article.

FROM: “Hero” (interactive) plot being at the top of the article.

TO: We now have a slide deck that explains the overall steps in the BO framework.

Scientific Correctness & Integrity.

Minor Points

For comparing vs. a random strategy, I would just compare against a truly random strategy, instead of using a "random acquisition function" within the BO framework as a proxy for this.

We believe that a truly random strategy is the same as the random acquisition function.

In the "comparison" plots, the random strategy appears to do about as well the Bayesian Optimization -- which means this is not a setting that convinces me that BO is powerful.

After inputs from Jasper Snoek, the BO framework is performing exceptionally better than the earlier version of the article. Therefore we feel this point is no longer an issue. For comparison, Please see the difference in performances in the plot below.

In the "comparison" plot, different acquisition functions correspond to different objectives. However, we are evaluating them all under the same objective, which is somewhat unfair.

We believe the comparison to be fair given our task is to compare the different acquisition functions for optimizing a black box function in the least number of iterations.

Major Points

With respect to the scientific content, my main issue is that there is no clear distinction made...

The point raised in this section raises an issue with the absence of any distinction between the theoretical exposition of BO w.r.t. practical tips for using BO in real life without provable theoretical results.

Upon receiving similar inputs from the reviewer earlier, we updated the newer article to focus explicitly on the practical use case of BO.

We again want to thank the reviewer for their valuable suggestions. The article significantly improved upon moving forward with the suggestions from the reviewer.

distillpub / post--bayesian-optimization

Review #2 #10

Comments

Comments

Comments

Minor points:

Major points:

Communication

Verbose and Lengthy Article

Interactive figure without context

Too many non-interactive simulations

Bird's eye view missing

Scientific Correctness & Integrity.

Minor Points

Major Points