distillpub / post--bayesian-optimization

Exploring Bayesian Optimization
https://distill.pub/2020/bayesian-optimization/
71 stars 16 forks source link

Review #1 #9

Open distillpub-reviewers opened 4 years ago

distillpub-reviewers commented 4 years ago

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to Austin Huang for taking the time to review this article.


General Comments

Missing Tools for Reasoning

Acquisition functions are introduced from a definitional standpoint and their behavior is illustrated for a relatively artificial example. Sometimes the methods are shown to work, sometimes they don't. How does one think about implementation alternatives when working on a new problem? The article provides few conceptual tools for the reader to apply these methods successfully.

There's also serious issues with model misspecification underneath the surface of these implementations (see for example, Thompson Sampling discussion). However, the article doesn't even raise the topic - the discussion starts from a fixed model specification and anecdotally shows methods either working or not under a narrow example.

Relatedly, there's a section entitled ""Why is it easier to optimize the acquisition function?"" This framing may be misleading since ""easiness"" isn't the goal. The real question seems to be ""Why is it beneficial to optimize the acquisition function?"" or perhaps ""is it even beneficial to optimize with respect to an acquisition function""?

Does the Hero Plot Illustrate a Cental Aspect of the Discussion?

An interactive visualization communicates a response function to the variables that can be affected by input. In the hero plot, this corresponds to the response of the activation function as a function of the epsilon hyperparameter in a PI acquisition function for fixed data and ground truth. It also shows the CDF for two slices of X (1.0 and 5.0) which are intermediate computations used by the activation function.

Is that particular relationship sufficiently central to the article to be front and center? There are other relationships that seem more central to the topic that could have been highlighted (how choice of acquisition functions compare, how the activation function changes with data). The plot is nice to interact with for thinking about exploration/exploitation in PI, but it doesn't seem to be an obvious choice as the hero plot.

Minor visual issue - the vertical labels look buggy, with 0.00e+0 cutting through the axis line.

Grey backgrounds don't fit Distill's Template

The patch of grey rectangle background for each figure doesn't fit the aesthetic of the distill template. The convention in other articles seems to be white-on-white with no boundary or occasionally a horizontal ribbon that runs the width of the page for visualizations with lots of margin content.

Animations are Overused

Note in other distill articles, animations are used sparingly, and usually just at the top figure or concluding figure.

Looping animations were overused and ultimately not a good way to illustrate a dependency relationship compared to a visual with a control.

Even if the content in those figures is kept as is with a slider http://worrydream.com/LadderOfAbstraction/, this would be an improvement by not being distracting and allowing the reader to examine relationships between iterations more carefully.

Introduction to EI is Confusing

Perhaps the framing using the unknown ground truth was the original motivation but here it just makes the reasoning convoluted without adding much insight. Don't see any reason not to just jump to the definition as described by the name - expected improvement (i.e. the 2nd equation).

Thompson Sampling

""It has a low overhead of setting up."" - not sure why this is specifically pointed out in the case of TS, is overhead any lower to set up than the other acquisition functions?

The statement that ""This will ensure an exploratory behaviour."" is contradicted by the animation demonstration that follows. From that demo's figures, it would actually seem nearly impossible to reach the global minimma without refining the underlying GP model - there's not enough noise in the function distribution to adequately explore. However the example is simply left without further comment.

Hyperparameter Tuning - Axis Labels

Using the horizontal label ""# of Hyper-Parameters Tested"" is a confusing label description since it doesn't really refer to the # of hyper-parameters tested, but rather the # of values that have been evaluated.

Hyperparameter Tuning - Changing colormap scale makes it impossible to track the function evolution

The colormaps should probably not rescale with each iteration - it makes it very difficult to track the evolution of the acquisition function between frames.

As mentioned above, replacing all or most animations with a slider control would also improve the legibility of the figure.

Legend tweaks

"# Minor Writing Improvements

Concluding Comments

Bayesian optimization and active learning aren't particularly popular to write about currently. I also suspect there's quite a bit of interest in the topic, particularly in industry and applied machine learning contexts.

Given that, this article does contribute to a notable gap in the research distillation space. However, I think more work needs to be put into this manuscript to raise the quality of communication to be comparable to other distill articles."


Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest What type of contributions does this article make?: Explanation of existing results

Advancing the Dialogue Score
How significant are these contributions? 3/5
Outstanding Communication Score
Article Structure 2/5
Writing Style 3/5
Diagram & Interface Style 3/5
Impact of diagrams / interfaces / tools for thought? 3/5
Readability 3/5
Scientific Correctness & Integrity Score
Are claims in the article well supported? 3/5
Does the article critically evaluate its limitations? How easily would a lay person understand them? 3/5
How easy would it be to replicate (or falsify) the results? 4/5
Does the article cite relevant work? 4/5
Does the article exhibit strong intellectual honesty and scientific hygiene? 3/5
apoorvagnihotri commented 4 years ago

We want to thank the reviewer for reviewing the article in such depth and providing actionable improvements. We will address the specific comments below.

Missing Tools for Reasoning

Acquisition functions are introduced from a definitional standpoint and their behavior is illustrated for a relatively artificial example. Sometimes the methods are shown to work, sometimes they don't. How does one think about implementation alternatives when working on a new problem? The article provides few conceptual tools for the reader to apply these methods successfully.

Our article earlier had a modeling error that was pointed out in the review by Jasper Snoek. Moving forward with the suggestion in the review, the updated article’s BO framework is able to perform much better.

We have further expanded the section where we introduce the concept of acquisition functions where we try to give reader the core ideas behind acquisition functions before introducing them.

There's also serious issues with model misspecification underneath the surface of these implementations (see for example, Thompson Sampling discussion). However, the article doesn't even raise the topic - the discussion starts from a fixed model specification and anecdotally shows methods either working or not under a narrow example.

We were not able to understand the issue that the above comment talks about, and would like to clarify the point raised.

Relatedly, there's a section entitled ""Why is it easier to optimize the acquisition function?"" This framing may be misleading since ""easiness"" isn't the goal. The real question seems to be ""Why is it beneficial to optimize the acquisition function?"" or perhaps ""is it even beneficial to optimize with respect to an acquisition function""?

We would like to thank the reviewer for noticing this. Yes, we want to convey why it is more beneficial to optimize the acquisition function instead of the earlier question. We have made the recommended corrections to the title of the question in the article.

FROM: Why is it easier to optimize the acquisition function?

TO: Why is it beneficial to optimize the acquisition function?

Does the Hero Plot Illustrate a Cental Aspect of the Discussion?

An interactive visualization communicates a response function to the variables that can be affected by input. In the hero plot, this corresponds to the response of the activation function as a function of the epsilon hyperparameter in a PI acquisition function for fixed data and ground truth. It also shows the CDF for two slices of X (1.0 and 5.0) which are intermediate computations used by the activation function. Is that particular relationship sufficiently central to the article to be front and center? There are other relationships that seem more central to the topic that could have been highlighted (how choice of acquisition functions compare, how the activation function changes with data). The plot is nice to interact with for thinking about exploration/exploitation in PI, but it doesn't seem to be an obvious choice as the hero plot.

Upon having a similar discussion with one of our reviewers, we introduced a slide deck at the end of the discussion where we summarise BO in a few slides. We have further moved the interactive plot the comment talks about below the discussion where we introduce PI.

FROM: “Hero” (interactive) plot being at the top of the article.

TO: “Hero” (interactive) plot now after the relevant section “Probability of Improvement”.

Minor visual issue - the vertical labels look buggy, with 0.00e+0 cutting through the axis line.

We have updated the article to no longer have buggy labels.

Grey backgrounds don't fit Distill's Template

We have updated the plots with Distill's template.

FROM and TO:

Animations are Overused - Even if the content in those figures is kept as is with a slider, this would be an improvement by not being distracting.

We have reduced the number of animations and added a slider to each of these animations for better control.

Introduction to EI is Confusing

We have re-framed the introduction in the newer version of the article.

FROM: Probability of improvement only looked at how likely is an improvement, but, shouldn't we be looking into how much we can improve? The next criterion called, Expected Improvement, (EI) does exactly that!

In this acquisition function, t + 1th query point, xt+1, is selected according to the equation below.

TO: Probability of improvement only looked at how likely is an improvement, but, shouldn't we be looking into how much we can improve? The next criterion called, Expected Improvement, (EI) does exactly that! The idea is fairly simple - choose the next point as the one which has the highest expected improvement over the current max f(x+), where x+ = argmax xi ∈ x1:t f(xi) and xi is the location queried at ith time step.

Thompson Sampling

""It has a low overhead of setting up."" - not sure why this is specifically pointed out in the case of TS, is overhead any lower to set up than the other acquisition functions?

We have added the explanation regarding our claim "It has a low overhead of setting up", in the newer version of the article.

FROM: One more acquisition function that is quite common is Thompson Sampling. It has a low overhead of setting up.

TO: One more commonly used acquisition function is Thompson Sampling. It has a low overhead of setting up as one only needs to sample from the model and one doesn’t need to find values for the tail probability, or the expected improvement.

The statement that ""This will ensure an exploratory behaviour."" is contradicted by the animation demonstration that follows. From that demo's figures, it would actually seem nearly impossible to reach the global minimma without refining the underlying GP model - there's not enough noise in the function distribution to adequately explore. However the example is simply left without further comment.

Our GP surrogate was not able to model the ground truth effectively due to the reason pointed out in the review by Jasper Snoek. Upon updating the article with his modeling suggestions we no longer have the above issue and using Thompson Acquisition function we are able to get to the global maxima with much ease.

Hyperparameter Tuning - Axis Labels

Using the horizontal label ""# of Hyper-Parameters Tested"" is a confusing label description since it doesn't really refer to the # of hyper-parameters tested, but rather the # of values that have been evaluated.

We have made the suggested change in the updated article.

Hyperparameter Tuning - Changing colormap scale makes it impossible to track the function evolution

The colormaps should probably not rescale with each iteration - it makes it very difficult to track the evolution of the acquisition function between frames.

We no longer have changing colormaps for each iteration.

As mentioned above, replacing all or most animations with a slider control would also improve the legibility of the figure.

As mentioned above, we have reduced the number of animations and added a slider to each of these animations for better control.

Legend tweaks

  • The legend positioning for the top ""hero"" plot looks buggy. ""GT"", ""GP"" and ""\epsilon"" are glued to the point without any spacing. The alignment looks very off
  • Not sure why ""GT"" is abbreviated when longer captions like ""Acquisition function"" are not.
  • ""Train points"" -> ""Training points""
  • Given the legends are already really busy ""(Tie randomly broken)"" would be better as a linked footnote."

We thank you for your time to notice these minute details. We have updated the legends accordingly.

Minor Writing Improvements

  • ""Older problem - Earlier in the active learning problem ... "" can remove the preface and start with ""In the active learning problem ...""

We have modified the description which now directly gets to the main point we want to highlight. Below are the exact changes made.

FROM: Problem 2 requires us to find the location where the gold content is maximum. Even though the problem setting may be similar, the objective is quite different than Problem 1. In other words, we just want the location where we can drill to get the most gold.

Older problem - Earlier in the active learning problem, our motivation for drilling at locations was to predict the distribution of the gold content over all the locations in the one-dimensional line. We, therefore, had chosen the next location to drill where we had maximum uncertainty about our estimate.

In this problem, we are instead interested to know the location at which we find the maximum gold. To get the location of maximum gold content, we might want to drill at the location where predicted mean is the highest, i.e. to exploit. But unfortunately our mean is not always accurate, so we need to correct our mean which can be done by reducing variance or exploration. Bayesian Optimization looks at both exploitation and exploration, whereas in the case of Active Learning Problem, we only cared about exploration.

TO: Problem 2 requires us to find the location where the gold content is maximum. Even though the problem setting may be similar, the objective is quite different than Problem 1. The present problem deals with finding the location where our black-box function reaches the maximum. In contrast, the earlier problem focuses on getting a good estimate of the black-box function.

Given the fact that we are only interested in knowing the location where the maximum occurs, it might be a good idea to evaluate at locations where our surrogate model's predicted mean is the highest, i.e. to exploit. But unfortunately, our model mean is not always accurate (since we have limited observations), so we need to correct our model, which can be done by reducing variance or exploration. BO looks at both exploitation and exploration, whereas in the case of active learning, we only cared about exploration.

  • ""We can write a general form of an acquisition function ..."" this sentence could be more weight and made more explicit about stating that mu(x) models exploitation and sigma(x) represents the value of exploration. It's implied by the phrasing, but could be clearer.

We no longer mention that acquisition functions are a function of mean and variance as done in the earlier article. That description limits the acquisition functional space and forces any acquisition function to be of the form g(mean(x), uncertainty(x)) which isn’t entirely true.

We now have a discussion where we point out that acquisition functions are essentially a sequence of inexpensive optimizations focusing on three core ideas: i) they are a function of the surrogate posterior; ii) they combine exploration and exploitation, and iii) they are inexpensive to evaluate.

  • Don't nest parenthesis in parenthesis ""(of function values (gold in our case))""
  • ""first vanilla acquisition function"" - reference UCB directly instead of referring to it as ""first vanilla acquisition function""
  • (try to find the global maxima that might be near this “best” location)"" - this parenthetical remark is confusing and doesn't add to the statement.

Based on the feedback received from one of the other reviews, we condensed some sections and the above issues are no longer present in the article.

We would like to give many thanks for the above comment regarding the terminology. We looked up various sources and understood that values for CDF of a Gaussian distribution are calculated making use of pre-calculated values of the error function, which are themselves calculated by Taylor’s expansion. Taylor’s expansion (which is a sum of infinite terms) always converges for the error function. Since, convergent infinite sums are called analytical expressions, we have updated the article to reflect this change.

  • "" h_{t+1} is our GP posterior of the ground truth"" - guessing intends to refer to the ""posterior mean"" since it needs to be a function

Yes, we did mean to refer "posterior mean" instead of the phrase used above. We have updated the article based on this suggestion.

  • ""easily"" is used a lot throughout the article and in almost all cases the sentence improves by the omission of this unnecessary subjective qualifier. ""equation can be easily converted..."", ""One can easily change ..."", ""We can easily apply the BO for more dimensions"", ""... can easily be incorporated into BO."" (2 times in the same sentence in the last example)

Thanks a lot for pointing this out. We certainly seemed to have used this qualifier too many times. We have removed this qualifier from almost all the cases where we thought it wasn't necessary.

We would like to thank the reviewer for the detailed and actionable comments above. The article improved significantly moving forward with the suggestions.