This is a place to park my thoughts on connecting this to multi-armed bandit literature.
Regret
Making regret plots would characterize how good the sampling policy is and would be interesting to see in the paper. Suppose $f: \mathcal X \rightarrow \mathbb R$ is some metric where the predicate $p: \mathcal X \rightarrow \mathbb [ \text{true}, \text{false} ]$ holds when $p(x) = f(x) \ge 0$. For our sample choice $\mathbf x_n$ at iteration $n$, the instantaneous regret if $r_n = f(\mathbf x_n) - f(\mathbf x_n)$. The cumulative regret after $T$ rounds is
$$RT = \sum^{n=1}{T} r_n$$.
A desirable property is no-regret convergence $\lim_{T \rightarrow \infty} R_T / T = 0$.
Sampling Policies
This is speculation
$\epsilon$-greedy: $\epsilon$ of the time, choose a random point, and choose the highest reward for the remaining time.
This is a place to park my thoughts on connecting this to multi-armed bandit literature.
Regret
Making regret plots would characterize how good the sampling policy is and would be interesting to see in the paper. Suppose $f: \mathcal X \rightarrow \mathbb R$ is some metric where the predicate $p: \mathcal X \rightarrow \mathbb [ \text{true}, \text{false} ]$ holds when $p(x) = f(x) \ge 0$. For our sample choice $\mathbf x_n$ at iteration $n$, the instantaneous regret if $r_n = f(\mathbf x_n) - f(\mathbf x_n)$. The cumulative regret after $T$ rounds is
$$RT = \sum^{n=1}{T} r_n$$.
A desirable property is no-regret convergence $\lim_{T \rightarrow \infty} R_T / T = 0$.
Sampling Policies
This is speculation
$\epsilon$-greedy: $\epsilon$ of the time, choose a random point, and choose the highest reward for the remaining time.
UCB: Upper Confidence Bound