two issues regarding sensitivity estimation

The attention weights are supposed to be something like expected KL after doing some number of rejuvenation moves. Given this, I have two fundamental insights about what we may be doing wrong in estimating expected KL.

1. Expected KL cannot be higher than maximum KL within the proposal density.

At the moment we may run into the case when expected KL is higher than the maximum KL from the proposal distribution:

In this example we have the seed and we sample from the proposal distribution (ancestral sampling in the latest MOT case). We compute the logscore of the sample (10.0 for argument's sake) as well as KL from the seed (e.g. 0.5 and, crucially, let's assume that this is the max KL possible from the proposal density). Let's say we have only this sample and we're trying to get to expected KL from this one sample. Now what we have been doing is either (1) we multiply by the exponentiated logscores and then log it or (2) we normalize the logscores among samples within the latent, multiply normalized expnonentiated logscores and then we add the logsumexp of the unnormalized logscores. But both of these would result in an estimate of expected KL higher than the max KL!

What makes sense instead is that with each sample, the KL of that sample has to be bounded by the maximum possible KL within the proposal distribution. Let me address the second issue now and then I'll suggest one solution that will solve both issues.

2. The sensitivity is extremely dependent on the seed quality.

The second issue may be even more important. The insight here is that sensitivity at the moment is very dependent on the initial seed quality in the sense that if one of the trackers is in a low posterior region within that seed, but can be corrected by, say, only one rejuvenation move, this is not reflected in sensitivity. In other words that tracker will be screaming "attend to me!" even though it can be easily corrected while there may be other trackers that initially have very low influence on KL, but if we move them to another region they may have a big influence. Take the extreme scenario below:

Image from iOS (11)

In this case, current sensitivity estimation may suggest high expected KL because all of the samples are higher probability (let's assume they all lead to some moderate KL). But note that in the case of actual rejuvenation, we just need to do the move once to get out of the low probability region. After that it may be that we'll not accept many samples or that these samples will not have high KL.

Proposal: Simulating the MH walk

I propose a simple procedure that would actually get at expected KL after doing several rejuvenation moves. Let's say we determine that 5 rejuvenation moves per tracker will give us an appropriate estimate of the expected KL. We initialize sum of KL at 0. Then we simply do MH moves. If we accept, we add the KL of that move to the sum of KL, if we reject we do nothing (add 0 to the sum of KL). At the end we divide the sum of KL by 5 and we get our expected KL.

NB: we could even not "waste" our rejuvenation moves for KL estimation by actually doing rejuvenation moves on the belief.

a few quick comments, i'll respond in full tomorrow.

I'm interested in the concept underlying the proposal although I don't think this particular sketch would work. Every accepted proposal changes the sensitivity landscape and thus KL. One possibility is have a march where you compare proposals for each tracker against each other and pick one winner but that is just a less efficient process compared to what we have.

i think concern two is valid in a way but not in the sense of your drawing. The posterior in our space is unintuitively complex. There are many local maxima that the particles jump to. Rather than having the seed in a chasm, it is in a local maxima and needs a push over some chasms to get to a better spot. This is why our current approach is to use one-at-a-time based off the MAP.

I think its also crucial to note that sensitivity is only of interest from the perspective of approximation. Sensitivity is not defined for the true posterior. In other words, the quality of the seed is irrelevant given that the ideal posterior is able to explain the observation well. Its impossible for sensitivity to give us a "bad" path to explore because the "badness" of those paths is a consequence of the generative model

concern one isn't really strong. We never really computed sensitivity that way. We only used log score to approximate the expectation independently for each latent and I don't see anything wrong with that (sensitivity based on jensen-shannon divergence is self-normalized).

i will also post these tomorrow but in the isr dataset there are very precise moments when attention kicks in due to sensitivity

You make some good points. My proposal is probably not going to work in practice. It's true that posterior is complex and it's very unlikely to be in some minimum like the one I drew.

Perhaps what I'm saying is that the I'm not completely sure whether our current approximation is correct. There are two main points here. We're approximating expected JS (jensen-shannon) divergence for one rejuvenation move, but at the end of the day we do not just one, but several sweeps per tracker (at least some of the time). So perhaps we want to take into account the dynamics of MH when doing several moves. "Every accepted proposal changes the sensitivity landscape and thus KL." -- that's precisely my point: the landscape is constantly shifting with every sweep, but we're only illuminating the landscape for the very first step of the journey. But I guess you're right that in practice this may not be that important. As in, it may be enough to focus on estimating JS for one rejuvenation move as this will be a good guess as to how much JS is expected upon rejuvenating the tracker in general.

The second, more important point is about estimating the expected JS even for one rejuvenation move. I was always a bit confused about how to use the weight of the proposed sample returned by regenerate.. Like it made sense to somehow take the weights into account when estimating the expected JS, but all of the options had something weird about them. The good news is that I think I figured this out now with a toy example below (also, the math below works out that expected JS will be bounded by max JS, in regards to issue 1):

Image from iOS (12)

The most important part is this: to calculate expected JS after one rejuvenation move you need to take into account

the probability of that sample under the proposal distribution (note if you're sampling from the prior, then the proposal distribution is symmetric and so the proposal distribution disappears from the acceptance ratio)
the acceptance ratio of that sample (this evaluates the sample with respect to the posterior)
JS of that sample

Say you have 3 samples. To calculate expected JS, for each sample first multiply the acceptance ratio (which is bounded by 1) of that sample with the JS of that sample. You get 3 expected JSs given the specific proposal samples. Then finally, you need to weigh each expected JS given the sample by the normalized probability of the sample under the proposal distribution.

In our case, the proposal distribution is the prior. So we need to assess the probability of drawing that sample under the prior (note that the weight from regenerate has nothing to say about that -- it is only used in calculating the acceptance ratio a = min(1, exp(weight))).

Not sure how much this will help with approximation itself. This is more of a theoretical insight, which seems right (unless I'm missing something). Would be interested to hear your thoughts.

I think you were right that it's fine to normalize, but the normalization is not with respect to the acceptance weight (which is the weight under the posterior), but rather with the weight under the prior. The acceptance weight does come in in the form of how much JS you can expect after sampling the point from the proposal distribution.

I have a proposal for dynamic attention using sensitivity. Here me out, I think this could address a couple of challenges at the same time!

The proposal is to do something like early stopping with JS. The basic idea is instead of estimating expected JS, we attend everywhere and adapt our attention based on past JS within that timestep. One particular implementation: we start by attending to all trackers, say, for 3 rejuvenation steps. Then after the 3rd step we start computing a moving average (say, over 3 moves) for JS statistics. If the moving average is above some threshold, then we continue attending on that tracker (until we max out on the sweeps). If it's below, then we stop paying attention to that tracker.

Issues addressed:

Lost compute in expectation estimation (this was one of the things raised during the CogSci workshop). With this approach we don't need to simulate attention to then apply attention, we can just do attention directly and dynamically adapt within timestep.
Multi-step nature of attention in our model. Instead of computing expected JS for the first step, this procedure would allow us to take into account the changing landscape after each rejuvenation step.

CNCLgithub / mot