NeurIPS '22 | Adaptation Accelerating Sampling-based Bayesian Inference in Attractor Neural Networks.

NorbertZheng / read-papers

My paper reading notes.

MIT License

7 stars 0 forks source link

NeurIPS '22 | Adaptation Accelerating Sampling-based Bayesian Inference in Attractor Neural Networks. #37

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Dong X, Ji Z, Chu T, et al. Adaptation Accelerating Sampling-based Bayesian Inference in Attractor Neural Networks.

NorbertZheng commented 1 year ago

Related Reference

Hubel D H, Wiesel T N. Laminar and columnar distribution of geniculo‐cortical fibers in the macaque monkey.
Ma W J, Beck J M, Latham P E, et al. Bayesian inference with probabilistic population codes.
Chen T, Fox E, Guestrin C. Stochastic gradient hamiltonian monte carlo.
Pouget A, Dayan P, Zemel R. Information processing with population codes.
Hebb D O. The organization of behavior: A neuropsychological theory.
Fung C C A, Wong K Y M, Wu S. A moving bump in a continuous manifold: a comprehensive study of the tracking dynamics of continuous attractor neural networks.
Dong X, Chu T, Huang T, et al. Noisy Adaptation Generates Lévy Flights in Attractor Neural Networks.
Zhang W, Wu S, Josic K, et al. Sampling-based Bayesian inference in recurrent circuits of stochastic spiking neurons.

NorbertZheng commented 1 year ago

Use generative model to understand world

In the process of perceiving the world, the perception system of the living body collects the distribution of observation data in the real world

$$ O \sim p_{d}(O), $$

all that we can learn about the real world is contained within it. The generative model believes that the core task of the agent is to

reconstruct the observed data distribution (auto-encoding?).

$$ p{\theta}(O) \to p{d}(O). $$

In the process of reconstruction, due to resource constraints, the model is forced to learn the laws behind the data distribution. These laws are encoded by

latent features $s$,
their generated likelihoods $p_{\theta}(O|s),

and the data distribution of the generative model can be written as

$$ p{\theta}(O)=\int p{\theta}(O|s)p_{\theta}(s)ds. $$

After the generative model learns this data distribution, every observation $O$ it encounters will form a conjecture $p_{\theta}(s|O)$ about the hidden variables behind it, thus forming our perception of the world.

NorbertZheng commented 1 year ago

To give a simple example, when the visual system observes the following picture (Fig. 1.A middle), the orientation of the ruler in the picture (Fig 1.A left) is the information that is concerned (about what information we are paying attention to, You can refer to the previous essay #38, which will also be the focus of our follow-up work). Then our nervous system needs to calculate this posterior distribution $p_{\theta}(s|O)$ to form a perception of the orientation of the ruler. To describe the computation of the posterior distribution, we first consider how the neural system represents probability distributions.

Figure 1. (A) Schematic diagram of generative model (B) Schematic diagram of continuous attractor neural network (CANN). 640

NorbertZheng commented 1 year ago

How the neural system represents probability distributions

When the visual system receives a stimulus $O$ (Fig. 1.A middle), neurons in the primary visual cortex will fire $I$ accordingly (Fig. 1.A right). This step is considered to be done by a feature extractor composed of retina-LGN-V1. $I(x)$ represents the firing frequency at $x$ in the space of latent variables $s$, and $I$ is a function of $O$, e.g. $I=g(O)$, then we have

$$ p(O|s)=\left|\frac{\partial g}{\partial O}\right|p(I|s). $$

In experiments, we can calculate

$$ \begin{aligned} p(I(x)|s)&=Poisson(\lambda(s)),\ \lambda(s)&=\exp\left[\frac{(x-s)^{2}}{2a^{2}}\right]. \end{aligned} $$

Therefore, we can get the neural system representation of the likelihood probability $p(O|s)$ from $I$, e.g.

$$ \begin{aligned} p(O|s) \propto p(I|s) &= \mathcal{N}(s|s^{\circ},\Lambda^{-1}),\ s^{\circ}&=\frac{\int xIdx}{\int Idx},\ \Lambda&=a^{-2}\int Idx,\ \end{aligned} $$

It can be seen that $p(O|s)$ is the Gaussian distribution of $s$, and the information about the mean and variance is included in the firing frequency $I$, which is also called probabilistic population codes [2].

NorbertZheng commented 1 year ago

On the other hand, we know that the prior probability $p(s)$ is independent from $O$, so its information needs to be stored in the nervous system, that is, between the connections of neurons. The neural system needs to integrate the connection information containing the prior with the likelihood probability $I$, to get the posterior distribution [9]. When $p(s)=\mathcal{N}(s|\mu,L^{-1})$, the posterior distribution we finally need to calculate is,

$$ \begin{aligned} p(s|O) \propto p(O|s)p(s) &= \mathcal{N}(s|\kappa,\Omega^{-1}),\ \kappa&=\Omega^{-1}(\Lambda s^{\circ}+L\mu),\ \Omega&=\Lambda+L,\ \end{aligned} $$

The process of integrating the connection of neurons and the activity of neurons requires the dynamic process of the neural system, which is the sampling process of the posterior probability $p(s|O)$. Specifically, the following dynamic system completes the Hamiltonian sampling of (HDF[3]) of the above equation $p(s|O)$,

$$ \begin{aligned} \tau{s}\frac{ds}{dt}&=\alpha^{-1}y,\ \tau{z}\frac{dy}{dt}&=-\beta\alpha^{-1}y+\Omega(\kappa-s)+\sqrt{\tau{z}}\sigma{y}\xi,\ \end{aligned} $$

and its steady state distribution is $\tilde{p}(s)=p(s|O)$. That is, a probability distribution is represented by the steady-state distribution of a dynamical system.

This representation method uses the points obtained by the dynamic sampling to approximate the target distribution, so we cannot complete the expression of the target distribution in an instant, but requires a period of accumulation. Then the sampling speed is very important. The faster the sampling speed, the faster the expression of the target distribution can be obtained, so as to survive the cruel natural competition. Compared with first-order sampling algorithms (such as Langevin sampling FLD), Hamilton sampling is a very fast sampling method.

Figure 2. (A) Schematic diagram of Hamiltonian sampling. The bump formed by the network moves over time, thus forming the sampling. (B) Phase diagram of Hamiltonian sampling behavior. 640

NorbertZheng commented 1 year ago

Implementing Hamiltonian sampling method with Continuous Attractor Neural Network (CANN)

Continuous Attractor Neural Network (CANN) is widely used to describe how the neural system encodes continuous variables [4]. For the variable $s \in \mathbb{R}$ that needs to be encoded, a group of neurons with different preferences for $s$ are connected to each other (Fig. 1.B), and with $r(x,t)$ and $U(x,t)$ represent the firing frequency and recurrent input strength of the neurons in the preferred position $x$ at the moment $t$ respectively, and the network dynamics are as follows

$$ \tau{s}\frac{\partial U(x,t)}{\partial t}=-U(x,t)+\rho \int{x'}W(x,x')r(x',t)dx'+\gamma I^{ext}(x)-V(x,t), $$

where $W(x,x')$ is the connection weight between the neurons, $I^{ext}$ the input from the forward neural network encoding the likelihood probability $p(O|s)$, e.g. $I^{ext}=g(O)$, $V(x,t)$ is the negative feedback modulation, realized by the dynamics of calcium ion channels,

$$ \tau{z}\frac{\partial V(x,t)}{\partial t}=-V(x,t)+mU(x,t)+\sigma{V}\sqrt{\tau_{z}U(x,t)}\xi(x,t), $$

where $m$ is the strength of negative feedback. The connection weights between neurons are given by Hebbian learning [5]:

$$ \tau{w}\frac{dW(x,x')}{dt}=-\eta{1}(W(x,x')-\tilde{W}(x,x'))+\eta_{2}U(x,t)U(x',t), $$

where $\tilde{W}(x,x')=\exp\left[\frac{(x-x')^{2}}{2a^{2}}\right]$.

Setting $I^{ext}$ to 0, the steady state of $U(x,t)$ caused by $W(x,x')$ is exactly the prior distribution $p(s)$, e.g. $W(x,x')$ stores the prior distribution, and $W(x,x')$ is the sampling policy.

NorbertZheng commented 1 year ago

Maybe cortical cortices are encoding the prior of different sensory observations, and due to that they are orthogonal, they are equivalent to the likelihood given any other priors. They are not calculating posterior probabilities, but joint probabilities. Just like the integration of multi-head attention?

Replace $s$ with $p$, which represents different memory items. Then,we have

$$ p(p|O) \propto p(O|p)p(p), $$

$p(p)$ represents the prior over memory items, which can be non-uniform, without action. And $p(p)$ should be a steady state of a sampling policy (MCMC or Hamiltonian sampling), e.g. uniform $p(p)$ corresponds to diffusion sampling policy. And $O$ is the joint sensory observation that may come from many sources. It seems that the queried weights are exactly $p(p|O)$. Hippocampus is just a diffusion sampler.

TEM (and TEM-t) does not use $p(p)$, because there is no dynamics in hippocampus at all. Unlike the original hebbian learning, TEM reduces the effect of some memories, which are experienced too much, on retrieval process. There is no $p(p)$ used in TEM at all, e.g. TEM directly use $p(O|p)$ to estimate $p(p|O)$, or TEM set $p(s)$ to a uniform distribution.

NorbertZheng commented 1 year ago

Previous work [6] proved that the state of a continuous attractor network can only be transferred on a one-dimensional manifold consisting of bumps, which is the attractor space

$$ \tilde{U}_{s}(x)=u_{0}\exp\left[\frac{(x-s)^{2}}{4a^{2}}\right], $$

where $u_{0}$ is the height of the bump. Our work further demonstrates that the transition of the state of the continuous attractor network is a Hamiltonian sampling of the latent variable (Fig. 2.A). Specifically, the formula $(5) - (7)$ can be simplified to

$$ \begin{aligned} \tau{s}\frac{ds}{dt}&=-\frac{L}{\alpha}(s-\mu)-\frac{\Lambda}{\alpha}(s-s^{\circ})+mz,\ \tau{z}\frac{dz}{dt}&=-z+\tau{z}\frac{ds}{dt}+\sigma{z}\sqrt{\tau_{z}}\xi,\ \end{aligned} $$

It can be proved that the above formula $(8) - (9)$ is equivalent to the Hamiltonian sampling formula $(3) - (4)$. In this Hamiltonian system,

the mass $\alpha$ is proportional to the height of the network bump, higher bump height means greater inertia,
the friction coefficient $\beta$ is negatively related to the strength of negative feedback, and greater negative feedback means a smoother energy space , that is, less resistance.

NorbertZheng commented 1 year ago

We also analyzed the influence of the parameters in the network on the specific sampling rate in detail, so as to obtain the parameter range of the fastest sampling (Fig. 2.B). It is very interesting that when $I^{ext}=0$, it can be considered that the neural system has no external input, and the fastest sampling parameter range corresponds to the optimal random search strategy Levy flight, which is used in animal trajectories in nature. Extensive observations (corresponding to the real spatial position at this time, generally considered to be encoded in the hippocampus), which corresponds to our work at NeurIPS last year [8].

For GCN, GraphSAINT-RW is exactly Levy flight. This is not surprising, because if all nodes are accessible at the beginning (but limits the flexibility of the representation), Small World + Link must be the best, which is Levy Flight. After all, diffusion is easy to oversample the surrounding nodes in the early stage, which is not so sample-efficient.

NorbertZheng commented 1 year ago

Then we generalize the conclusion to high-dimensional $s \in \mathbb{R}^{M}$, which requires

a set of mutually coupled attractor networks,

each attractor network $U{i}(x,t)$ encodes the corresponding dimension of the latent variable $s{i}$, and there are mutual connections between the attractor networks to store the prior distribution $p(s)$, when the neural system receives the observations $O$, each attractor network samples the marginal distributions $p(s{i}|O)$ using Hamiltonian sampling method, while the overall network samples the joint distribution using Hamiltonian sampling method (Figure 3). As a simple example, in the task paradigm of contour integration, subjects can identify continuous contour information $p(s{i}|O)$ (Fig. 4. right) from isolated images $O$ with noise (Fig. 4 middle). This is because the contours we see are mostly continuous (Fig. 4. left) and thus have a strong continuous prior on the probability distribution $p(s)$.

Figure 3. Implement Hamiltonian sampling method with Continuous Attractor Neural Network (CANN). 640

Figure 4. An example of contour integration, where subjects can generate contour information after seeing observations. 640