our framework captures bottom-up attention/salience

I think our attention framework can capture what one may call "bottom-up attention". I want to present the idea again and see what you think.

My reasoning:

There is always a basic task that the perceptual system has to do, namely to explain the incoming sensory data in some general fashion. This is done automatically irrespective of the current higher-level task like MOT. You can think of this as the "grand task" that perception has learnt to do through evolution, making sure that the agent doesn't get eaten, doesn't fall off a cliff, etc.
The perceptual system explains the incoming signal by searching for the simplest hypothesis/model that explains the incoming sensory data. We could formalize this by thinking of searching for a model whose evidence reaches some sort of a threshold.
Distinctive things draw attention because they require accommodations within this hypothesis. For instance, in the visual below the first hypothesis explored may be "red dots on the left, blue dots on the right", but the green dot on the right does not fit this simple hypothesis. This requires to search further and instantiate a separate representation for that dot -- that's why/how the green dot attracts attention within our framework.

Image from iOS

Just wanted to share a couple of thoughts as I've been preparing for interviews:

I think we can have another linking hypothesis between attention and probe detection. Note that given a particular hypothesis about state, receptive fields are independent. So if we jitter some local part of the state (e.g. just for the green dot above), we only need to reevaluate a local region in the image space (i.e. the receptive field that the green dot maps to). This may explain why flashing a probe in the green dot region is more salient (even if eyes are fixated in the center) -- when a new hypothesis is proposed for the green dot, only the local receptive field likelihood needs to be reevaluated. But while reevaluating, further likelihood collapse when a probe is flashed will need to be addressed and probe will be instantiated in the belief space (I got the idea of efficient reevaluation of likelihood from Bob Rehder and Markov blanket discussion, page 6, https://onlinelibrary.wiley.com/doi/pdf/10.1111/cogs.12839). With receptive fields, we have a direct link between attention in state and particular regions in the image.

As said above in the original post, I think bottom-up attention falls out of our model, given the general task of explaining the sensory signal. But the potentially cool part is that we can also create a saliency map now that we have receptive fields. Given the connection between state and image regions described in (1) and given a generic task like maximizing model evidence we can create saliency maps by simply aggregating the number of times the likelihood is reevaluated with respect to a particular receptive field.
Also, their "inhibition of return" naturally falls out from our model. There is no need to remember which high attention places have been visited already in our model -- those places are simply not interesting with respect to sensitivity.

I wonder what you think :)

CNCLgithub / mot