bayesoptbook / bayesoptbook.github.io

Companion webpage for the book "Bayesian Optimization" by Roman Garnett
MIT License
876 stars 42 forks source link

Define Holder continuity for covariance functions #47

Open chrisyeh96 opened 8 months ago

chrisyeh96 commented 8 months ago

The theorem on page 30 about sample continuity of a Gaussian process uses Holder continuity, but Holder continuity is never defined in the book. It would be very helpful to see a definition, especially since the book states in the subsequent paragraph that

the squared exponential covariance function (2.4) is Hölder continuous

without providing additional justification.

Furthermore, Hölder continuity is usually defined for functions that take a single input. According to Wikipedia, a function $f: \mathcal{X} \to \mathbb{R}$ is $\alpha$-Hölder continuous on the metric space $(\mathcal{X},d)$ if there exists a constant $0 < C < \infty$ such that

$$\forall x,y \in \mathcal{X}:\quad |f(x)-f(y)| \leq C \, d(x,y)^\alpha.$$

However, a covariance/kernel function $k(x,y)$ takes two inputs. How is Hölder continuity defined for covariance functions, especially non-stationary covariance functions?

bayesoptbook commented 8 months ago

This is a great question, and I apologize for the confusion. I agree with you that I should probably have expanded footnote 26 in chapter 2 to include an explicit definition of Hölder continuity rather than the hand-wavy description I gave. I think I was trying to avoid as much theory in chapter 2 as possible.

Regarding checking the Hölder continuity of a covariance function K(x, x'), the short answer is that you can simply check the Hölder continuity of the functions K(x, ·) (and x can be any arbitrary point in the domain). Note that:

However, this answer isn't terribly insightful.

A nicer way to connect the smoothness of the covariance function to the smoothness of sample paths (which is what we're after here after all) is to assume Hölder continuity of f and work backwards from there. I sketch this reasoning below without actually proving anything, but hopefully it's enough to build intuition.

Let f ~ GP(μ, K), and here let us assume that the GP is centered so that μ ≡ 0. (We can deal with nonzero μ later [*].) Let x and x' be points in the domain of the GP, and let ϕ = f(x) and ϕ' = f(x'). Hölder continuity of f at x (we can use your quoted definition here) is a condition on |ϕ - ϕ'|, but this is actually a random variable.

Instead, we can work with a suitable statistical replacement. It turns out the right notion here is the quantity

√E|ϕ - ϕ'|².   (1)

(Note that this equals |ϕ - ϕ'| in the deterministic case, as expected.) We want to control the rate this term vanishes as |x - x'| → 0. If you can accept this definition, we can now define a probabilistic version of the Hölder condition:

√E|ϕ - ϕ'|² ≤ C|x - x'|ᵃ.

equivalent to (squaring both sides)

E|ϕ - ϕ'|² ≤ C|x - x'|²ᵃ.

Note that we can rewrite the lhs in terms of the covariance function:

E|ϕ - ϕ'|² = E[ϕ² - 2ϕϕ' + ϕ'²]
           = K(x, x) + K(x', x') - 2K(x, x').

(Here we relied on the GP being centered.) So what we really want for f to be α-Hölder continuous is for K(x, ·) to be 2α-Hölder continuous. Note that this still comes down to checking that K(x, x) - K(x, x') is well behaved as |x - x'| → 0, like before.

Note also that everything above is very similar in spirit to the continuity in mean square discussion in the book, just with a bit more care put into the continuity check.

As an aside, the term (1) actually defines a pseudometric over the domain and appears often in the GP literature as the so-called canonical metric:

d(x, x') = √E|ϕ - ϕ'|².

Hope this helps!

[*] To deal with nonzero μ just assume it's Hölder continuous and add it back in.

bayesoptbook commented 7 months ago

Did this help @chrisyeh96?