bayesoptbook / bayesoptbook.github.io

Companion webpage for the book "Bayesian Optimization" by Roman Garnett
MIT License
865 stars 42 forks source link

Missing hyperparameter in model assessment example? #44

Open dothit opened 7 months ago

dothit commented 7 months ago

In chapter 4 on model assessment, you present an example where you condense the prior constant mean function into the covariance function via the parameter b in Eq. (4.5) and then present the hyperparameters as sigma_n, lambda and ell. Should b not be part of the hyperparameters? It can be freely tuned and is an unknown. Or am I missing something? (You do mention somewhere there exist cases where assessment is not tractable because the distribution wouldn't be Gaussian anymore.) And you do specify you assume independent noise, so b can't be "slurped up" by sigma_n.

Version and Location No date in the running footer. p. 69 (bottom)

modelassessment

bayesoptbook commented 7 months ago

Thanks for the comment, and this is indeed a bit unclear as written. Let me attempt to clarify.

First, as a parameter of a prior on a hyperparameter, I would describe b as a hyperhyperparameter (yikes). As a handwavy rule, the "higher up the chain" you go with (hyper...)hyperparameters, you should expect their influence to lessen somewhat -- probability has a tendency to sort of "smear out" as you travel along the hierarchy.

To speak about the role of b here in general, we are placing a zero-mean Gaussian prior on the unknown value of the constant prior mean function with standard deviation b. In this setup, as long as the desired value of that constant is in the support of the prior, you'll probably be fine. I wouldn't think you'd need to worry too much about setting b "correctly" more than the rough advice that it's probably better to be too big (so that the desired value is in the support of the prior) rather than too small (such that the desired value is way out in the rapidly shrinking tails of the prior). The only real danger of setting it too big is a bit of residual predictive uncertainty that you might not have needed if it were perfectly calibrated.

The actual data used for this simulation is

x = [ 0.9362,  1.9780,  2.4643,  5.5622,  7.0311,  8.9378, 10.0890, 10.7920, 
     13.0910, 14.1660, 14.4150, 15.5110, 16.3880, 19.5600, 29.3570]

y = [0.5521, 0.0507, -0.2944, 0.8063,  0.8084, -0.0225,  0.5605, 
     2.4103, 1.7455,  1.9096, 1.6252, -0.1103, -1.5068, -1.6553, -0.5397]

The observed values range from roughly -1.7 to roughly 2.4. The desired mean is presumably somewhere in that range, and likely somewhere in the middle. For the simulation, I simply used a standard normal prior for the mean (that is, b = 1), which I thought was reasonably well specified.

Reading this page, I also noticed that I chose the rather unfortunate clashing notation X = [a, b] for the interval. I should perhaps update that and add a footnote summarizing the above.

dothit commented 7 months ago

Thank you for clarifying. So on the one hand, it is perfectly reasonable to make an educated guess on the upper bound of b. On the other hand, you can include b in the set of hyperparameters, that is to say, there is no technical reason why that wouldn't work. It just won't really give you any "bang for the buck" for the reasons you have stated. Did I get that correct?

bayesoptbook commented 7 months ago

Yes, I think you got that correct. To elaborate a bit more, you could include b in your set of (hyper)hyperparameters and try to learn it. There would be nothing wrong mathematically with doing so. However, I would find that choice to be somewhat odd from a modeling point of view. To illustrate what can go "wrong," imagine we learned b in the example above via maximizing the marginal likelihood and it came back as 10⁻⁶. Or 10⁶. In my opinion, either would be completely nonsensical, but the marginal likelihood can't "know" that.

In the case of this particular (hyper)hyperparameter, if you want a data-driven approach, I think something like

b = [range of observed data] / 4

would be quite reasonable in any situation, ensuring that all plausible values for a constant mean function are in the support of the prior. (Note that you can also introduce a prior mean on the value of that constant, called a in the earlier discussion, which you could reasonably set using a similar rule of thumb, perhaps jointly with b.)