NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

arXiv '22 | How to build a cognitive map: insights from models of the hippocampal formation. #20

Closed NorbertZheng closed 2 years ago

NorbertZheng commented 2 years ago

Whittington, James CR, et al. How to build a cognitive map: insights from models of the hippocampal formation.

NorbertZheng commented 2 years ago

Abstract

Learning and interpreting the structure of the environment is integral to guiding flexible behaviors for evolutionary viability, and the concept of a cognitive map has emerged as one of the leading metaphors for these capacities. Theorists have been busy building models to bridge the divide between neurons, computations, and behavior. These models can account for a variety of known representations and neural phenomena, but provide:

In this perspective, we

NorbertZheng commented 2 years ago

Introduction

Cognitive maps were originally proposed as internal neural representations affording flexible behavior, such as planning routes or taking never-before-seen shortcuts. More recent descriptions formalized this view with the key concept of generalization. Here, the fundamental role of cognitive maps is to organize knowledge, facilitating generalization of this knowledge to novel experiences, and thus enabling the rapid inference from sparse observations which characterizes biological intelligence. So here we characterizes the role as few-shot learning, which is already implemented by James #16 . This terminology is also similar to the "meta-learning" in the psychology, which is not implemented by James. Meta-learning requires to learn the rule of the rule, e.g. there is a latent rule behind the explicit rule. It seems that we have a set of rules, which we can strengthen according to how well it matches the various tasks being performed, and it can be a partial-to-part match. For example, a certain area conforms to the square transition rule, and its scale can be changed to adapt, and the whole map is naturally in the hippocampus. grouped together.

NorbertZheng commented 2 years ago

Entorhinal cortex is known to support spatial cognition, and the characteristic hexagonal firing pattern of entorhinal grid cells is also found when animals navigate abstract spaces, e.g. human ERC & mPFC, monkey mPFC and etc. These parallels in representation suggest the mechanism for understanding the spatial cognitive map might ,in fact, be an instance of a more general coding mechanism capable of building abstract cognitive maps covering any domain. Looks like we need to refer to the whimsy from Numenta!

NorbertZheng commented 2 years ago

In recent years, many models of the hippocampal formation have attempted to do this, providing explanations of neural data and offering falsifiable predictions. While greatly informative, these models differ in their focus and the language of their formalism, obscuring the overall direction and vast potential of this work. The aim of this Perspective is to clarify the common theory underlying these models, while providing novel results offering normative explanations for a range of old and new neural phenomena, just like #19 does. We note that there are many other theoretical accounts of cognitive maps which do not address this issue of representation learning, e.g. Radulescu et. al. 2021, Sanders et. al. 2020, Stoianov et. al. 2020. While these models have provided mechanistic insights, they are not discussed in detail here.

NorbertZheng commented 2 years ago

The cognitive mapping problem

Cognitive maps organize knowledge to afford flexible behavior:

In order to achieve this, there are some requirements and desiderata for the neural representations of the cognitive map. Here, we describe these computational considerations and explain the models relevant to each. We aim to provide a clear conceptual understanding of the interlinked ideas.

NorbertZheng commented 2 years ago

Reinforcement Learning and planning

To afford successful behavior, cognitive maps must represent state (a particular configuration of the world). Reinforcement Learning (RL) is a formalism of this concept: actions are taken based on the current world state. Representing the entire world state is often infeasible, due to "curse of dimensionality". So we need an appropriate state abstraction, learning, or attending to, the appropriate abstraction is a central issue of the cognitive mapping problem.

We have to note that RL typically assumes that the underlying state-space is fixed, and the state-space in RL is tightly linked to behavior (through rewards, values, and policies). Classic (or model-free) RL slowly learns the value (value-based) of states, or which actions (policy-based) are good in which states, and requires no knowledge of how states relate to each other. While this is provably optimal in the long term, value-based learning is often inflexible and slow to learn, especially in situations where the dynamics of the environment keep changing.

Knowing the relationships allows you flexibly plan routes between any start and any goal state. Unfortunately, transitional planning mechanisms (such as tree-search, only have the information about local transition) are computationally costly, but alternatives do exist, e.g. Silver et. al. 2018, Botvinick et. al. 2012, Bush et. al. 2015. More broadly, with a clever representation of the state-space (see next subsection), the cost of planning can be reduced, or even completely avoided (e.g. successive representation, but it's passive, policy-dependent, not ideal enough).

This is a powerful way to formalize the central goal of cognitive maps:

NorbertZheng commented 2 years ago

Space as a state-space

The abstract location can be represented in a variety of ways; for example,

The choice of which representation to use has major consequences. For example,

These two representation types are analogous to place and grid cells in the hippocampal formation:

By a clever choice of representation, grid cells prevent the need for computation.

NorbertZheng commented 2 years ago

Non-spatial state-spaces

While it is easy to intuit good state-spaces in physical space, this problem becomes less clear in non-space. One promising approach, derived from RL, is to cast spatial learning as understanding relationships on a graph, and the previously mentioned coordinates are built on such graph (instead of continuous space). 6869961d0231555bff8de58a65aec24

This is a re-conceptualization of a map in terms of its connections (topology), as opposed to distances (geometry). Graphs afford planning via transition matrix T (just like tree-search).

The problem of building graphs for cognitive maps is the same problem as building state-spaces in reinforcement learning. However, once the state space is defined there is a further choice of how each state is actually represented. This is fine as long as there is no state coupling, but the representation of states can vary widely, with different representations suitable for different functions. Clever choice of representation can reduce online value/policy computations. This has allowed normative mathematical theories to predict neural representations.

Reinforcement learning is concerned with taking appropriate actions at specific states s to maximize the expected (discounted by γ) sum of future rewards: image If you can assign credit to each state (like these equations do, using the Bellman equation as the update rule), we can get the specified statistics of the reward distribution, e.g. the mean of the reward distribution. Then taking good (not optimal!) actions is easy: just go to the neighboring state with the highest value v(s'). This is often called the prediction problem, and we can go further to solve the control problem, as, in #13 , credit assignment is called policy evaluation, and taking the actions with the highest value is called policy improvement. Keeping this policy iteration, we will get the optimal policy. And if we have the transition model as mentioned in #19 , we can do a mental simulation (or offline planning) to reduce the burden of online update (or planning). As we mentioned before, RL typically assumes that the underlying state space is fixed. Obviously, this cannot always be satisfied, neither the transition statistics nor the state space itself, maybe the latter can be reflected in the previous one, e.g. state degeneracy (maybe not, for it might break the Markov property of the transition statistics, makes it history-dependent, which cannot be handled the transition statistics).

RL state-spaces define graphs with transition matrix, which means whatever the graph representation is, we represent connections between states in terms of transition-distance. If we train the model (e.g. TEM) to predict the next (only one step away from current state) state, then this graph representation will emerge.

Another graph representation, the successive representation (SR) #11, is particularly relevant to cognitive map. Critically, if we represent connections between states in the world in terms of the SR-distance, then computing value is easy, since the SR is one half of the value computation v=Sr. This is trying to factorize the computation of v, but SR is policy-dependent. SR is suitable for planning task, for SR consider both the effect of distance via γ and the probability of transferring from start location to goal location (maybe multiple steps). If we train the model (e.g. TEM-OVC) to navigate to a shiny object, then this graph representation will emerge in the factorized local ovc map. image

Is ovc a kind of successive representation? But ovc often contains vector, it resembles p(s'|s,a) instead of p(s'|s). Because it derives from a theory of learning, it can also account for behavioral phenomena that are otherwise hard to explain, as explained in Momennejad et. al. 2017.

One prominent issue with SR, however, is its policy-dependence. This means that when rewards move - or, worse, when obstacles appear - value calculations using SR are no longer optimal. Piray et. al. 2020 addresses this problem, using linera RL. The required default representation (DR) resembles the SR. The model further provides **a novel account of how to build world representations compositionally out of component cells representations (e.g. how grid and border cells interact to represent the insertion of a barrier, see Mark et. al. 2020 for more details).

NorbertZheng commented 2 years ago

Latent states and sequence learning

How do we know which graphs to build? Our world is not "fully observable"; instead, we face "partially observable" problems and must infer latent state representations. We have to infer latent states from sensory sequences, e.g. clone-structured cognitive graph (CSCG), building a latent state-space map can be used to afford different behaviors in sensorially identical situations. Neural representations in the hippocampal formation disambiguate states using latent representations. For example, rodent grid cells will initially code identically for two identical boxes. However, after realizing that the boxes are connected by a corridor, the grid representation changes to become consistent with the global two-box-and-corridor-space. These are latent state representations that disambiguate sensory aliased boxes due to their different futures (TEM also does this activation transformation, maybe not entirely the same). Splitter cells, place cells, grid cells, lap cells, and others, are all cellular examples of the cognitive map disambiguating the world into latent states. image

The CSCG model is an elegant approach for building de-alised state-spaces. Here, hippocampus contains multiple "clone cells" for each sensory observation. This model use Bayes to:

These transition weights are analogous to the transition matrix for graphs, but critically the state-space is learned, rather than provided by the modeller, which is the critical difference between CSCG and the following models. image

CSCG infers the whole latent space within the hippocampus (as opposed to the cortical input (maybe designed by modeller) to the hippocampus). This enables learning rules to be local, biologically plausible, and fast. It looks like the proposal I mentioned in #16 . The hippocampus may have its own dynamics, which supports the graph-learning process of entorhinal cortex. By contrast, CSCG has to learn each map de novo and cannot benefit from having learnt similar maps before. It is exciting to think how these benefits may be combined, e.g. Complementary maps in hippocampus and cortex. image

CSCG is closely related to hidden Markov models. From a sequence of sensory observations X={x_1,x_2,x_3,..,x_T} and actions A={a_1,a_2,a_3,...,a_T}, we wish to infer discrete latent states Z={z_1,z_2,z_3,...,z_T}.

Modelling the full sequence of observation is then: image And we have p(x|z_i∈C(x))=1 whereas p(x|z_i∉C(x))=0 if C(x) are the clones of x. CSCG marginalizes over z and uses the expectation-maximization (EM) algorithm to train the model - that is, learn an appropriate set of transition probabilities p(z_t,a_t|z_{t-1}) and infer z_t. Once trained, this model can be used for planning, but not like successive representation.

NorbertZheng commented 2 years ago

Path integration and compression

Inferring latent states is really a problem of understanding where you are in an abstract space. Entorhinal grid cells are considered an attractive substrate for path integration of two-dimensional spaces since:

Using (x,y)-coordinates to organize graphs offer a benefit compared to representing every individual connection between nodes: adding a new node immediately implies all other connections without needing to observe that relationship explicitly (even without off-line planning). Path integration doesn't require the knowledge of the entire world, it treats all nodes equally and relationships are structured (do not care about the specific meanings of relationship, only care about how to integrate among relationships). As such, only the few rules of path integration need to be known, not every possible relationship, that's why TEM can transfer among the environments where the same rule apply.

Not all graphs, however, can be path-integrated, since consistent actions do not always exist across graphs (for instance, social networks merely describe generic relationships).

To do path integration, continuous attractor neural networks (CANN) receive velocity input a. The neural dynamics are given by: image

With an appropriate set of weights, CANNs path integrate, with different cell classes (head-direction cells, place cells, grid cells) modeled with different weights (cannot be unified in one single framework?). Remarkably, CANNs really exists in nature; attractor manifolds are found in rodents.

Other path integrating models exist, velocity-coupled oscillators (VCOs) suggest path integration (along an axis) via interference between theta oscillations and velocity-dependent dendritic oscillations, with their phase difference indicating path integrated distance along an axis (this looks like a plane wave!). Here, grid cells are the sum of three such neurons with preferred axes at Π/3 relative angles.

One major limitation of CANNs and VCOs, however, is that the weights of the recurrent weight matrix, W, are carefully selected and not learned from sensory experience. However, it is easy enough to set up path integration as a learning problem via predicting observations x: path integrate the latent state variable z and then predict observations x from the latent states: image

Neural units in these models form periodic representations, but there are often amorphous 4-fold symmetric grids. image

Sorscher et. al. 2019, however, demonstrated that the 4- to 6-fold symmetry transition is governed by a single property: a third order regularization term of grid cells, like the following regularization loss item?

L_reg_g3 = tf.reduce_sum(tf.stack([tf.reduce_sum(g ** 3, axis=1)\
    for g in g_inf], axis=0), axis=0)

Indeed, this is easily implemented by the biological constraint of ensuring neural activity is positive.

NorbertZheng commented 2 years ago

Generalization

Generalization, or the transfer of knowledge from one situation to another, is the substrate of the profound behavioral flexibility exhibited by animals.

Generalizing with graphs, however, is hard as they require perfect alignment, which is NP-hard and thus impractical in most situations. Generalizing with periodic path integration representations, on the other hand, is easy since all positions are treated equally, e.g. we are binding rules instead of the entire graph to each part. This is generalization of relational knowledge.

What kinds of cells support generalization?

Spatial generalization, at least, seems to exist in entorhinal cortex and is consistent with path integration.

To actually make sensory predictions, however, you need to know more than just abstract knowledge. You need to know how it interacts with real world representations. One influential proposal is that hippocampal cells reflect this interaction, with abstract knowledge from MEC and sensory knowledge from LEC combined rapidly (fast-mapped) in hippocampus. image

We have seen models that build latent state representations, and models that path integrate. If these principles could be combined, we could build a powerful system that

Recall the probabilistic interpretation of path integration: image

Previously, p(z_t|z_{t-1},a_t) was fixed (corresponds to _gen_g) and so each abstract location z could only predict a single sensory observation x. If, instead, we had an address book (according to the index theory of hippocampus) of relational memories M, we could remember what is where in each environment, e.g. "what" did I see the last time I was "here".

While TEM and SMP are conceptually the same model, they have different implementations:

NorbertZheng commented 2 years ago

Novel interpretations, integrations, and predictions

These models often do so in seemingly divergent ways, and there are many neural phenomena that remain perplexing. Here we consider how these ideas can be integrated in order to model and understand cognitive maps at a deeper level, and offer novel accounts of several neural phenomena through a formal lens.

Non-spatial hippocampal cells are latent state representations for generalization

We have argued latent state representations serve two purposes:

These arguments suggest two things:

As a didactic example, spatial alternation task can be "un-rolled" into a "big-loop" state space, which is the latent space for the task, and de-aliases the common "trunk" section. This "big-loop", however, ignores the spatial knowledge - understanding the big-loop alone does not let you know you are back in the same place - to generalize spatial knowledge you additionally need a spatial representation. Hippocampal cells in this task indeed code for both space (place cells) and big-loop (splitter cells).

image

NorbertZheng commented 2 years ago

Complementary maps in hippocampus and cortex

Does hippocampus map space, or is it role one of memory?

The observation above is a distinction that offers a potential unification of the hippocampus role in mapping and memories: it is easier to learn how to generalize if each (latent) state-space is already build, just like prediction error, once the association has been built, the prediction error will be backpropagated, maybe GFlowNet will help? More precisely, should all states of the world be appropriately separated, the relationships between states known, cortex can receive high-fidelity training signals (since preditions, e.g. the generation process of cortex, can be compared to a de-aliased state-space), thereby significantly reducing the burden of learning.

This means entirely novel sets of relationships can be efficiently learned as follows:

image

This proposal follows complementary learning systems theory, where cortex slowly learns the statistics of hippocampal episodes. We take note of an interesting model proposed by Evans et. al. 2019, that, while not involving structural learning or generalization, leverages two independent systems for self-localization:

This integrated approach is realizable within the existing models. Since both TEM and CSCG utilize multiple "clone" hippocampal cells for each sensory observation, it is particularly easy to combine these models. This would be formulated as a TEM-like model, but where the hippocampus is predictive of future hippocampal states. Such an approach combines the best of both models - learning novel maps fast (CSCG), but also leveraging past knowledge to understand similarities between maps (TEM/SMP).

NorbertZheng commented 2 years ago

Cognitive maps and behavior

The models discussed here interact with behavior in different ways (including using eigenspaces for various behaviors).

The observation that grid cells resemble eigenvectors of place cells (or of the spatial transition matrix; if place cells use successive representation, then of the SR matrix) has led to interesting suggestions about mechanisms for planning and exploration.

To plan the future, you need to look across multiple transitions. Eigenvectors simplify this problem because all multi-step transition matrices (successive representation, there is already a static consideration of # of steps) share the same Eigenvectors.

Intuitively, this means these eigenvectors can be used for exploration, planning, sampling in replay, or any other type of multi-step navigation. Different sampling patterns differ only in the eigenvalue matrix Λ, as mentioned in #18 . By making a clever alternate choice of eigenvalue matrix (using a bespoke diagonal matrix, ϒ, rather than the diffusion eigenvalues, but still using the same eigenvectors; VϒV^Ts), very different strategies emerge, such as turbulence or super-diffusion, and can be seen in rodent hippocampal replay.

It seems the upstream brain area can modify the spectral in MEC to generate hippocampal representation suitable for current task. In fact, with another choice of weighting matrix, image you can get exactly compute the SR under a diffusive policy (because of the diffusive transition matrix Λ), which is closely related to the distance between states, e.g. the probability corresponds to the distance between current state and goal state. This is particularly interesting, as when you have distances planning is easy - just go to the neighboring state with the lowest distance from the goal. Of course, this is perfectly useful for generalization, because of policy-free, no task-bias at all.

So far, we have been considering diffusive transition matrices, i.e. matrices without actions. However, by making transition matrices actions dependent (remember path integration has action-dependent matrices too) we can play games just like path integration. In space, at least, the transition matrices needed for different actions all have exactly the same eigenvectors, but different eigenvalues. Maybe suitable for all graphs? After all, the transition matrix is not limited. Hence, path integration can be reduced to successively adding the eigenvalues associated with each action, which is also a Bellman equation conditioning on the policy used in the following environment (it bootstraps!). This way of thinking unifies path integration with SR-like planning, e.g. learning so that we don't have to learn. Interestingly, it also brings different models of path integration into a common framework since, in this case, the eigenvectors are plane waves (not grids as the transitions are unidirectional!) just like those required for VCOs, and the transition matrix is just like the weight matrices required for CANNs.

Maybe we generate superdiffusion is because the distance between current state and target state has already been updated in the probability representation, which could be used to explain the reverse replay after encountering a shiny object, as demonstrated in Eldar et. al. 2020 and the superdiffusive behavior, as demonstrated in McNamee et. al. 2021. image

NorbertZheng commented 2 years ago

Credit assignment through generalization and the interplay with striatal RL

RL typically assumes that the underlying state-space is fixed, and values are slowly assigned to these states. There is no requirement for state representations to be fixed (but in RL, only care about the algorithm, not care about the representation learning problem), however; they can change to better represent value. For example, after encountering a goal, GVCs (goal vector cells) form - cells that are active at certain distances and directions from goals (is that object vector cell?). image

This can be interpreted as a state representation augmentation (not only r(s,a), with more information about the direction (the distance is considered in the classical RL via discount factor γ, just a different representation of different information, maybe a different representation of the same information will also help, as demonstrated in #14 ). Importantly, since GVCs path integrate, once a single GVC forms at a goal, all others can be build for free as the animal navigates the map (looks like the generalization of grid cells, two adjacent grid cells always be adjacent).

Pre-learned goal-vector representations can be immediately composed with spatial representations to generate an accurate and flexible representation of any goal state, this is driven by reward signal, how about composition of subgraphs, driven by prediction error? The only online role of the cognitive map is inferring which pre-learned and pre-credit assigned representations to compose. This is credit assignment through generalization, and is akin to meta-RL, since prior statistical knowledge (e.g. GVCs) can be integrated on-the-fly to solve novel tasks.

Where does these representations come from in the first place? The cognitive map models suggest that such representations can be learned from statistics of behavior:

In general, to train these "pre-credit assigned" compositional representations, cortex must learn from sequences of behavior (which are generated via classical RL, perhaps in the striatum). Understandably, initial striatal actions will be bad (when encountering entirely novel tasks), but as RL learns good policies, actions will be towards goals (which are high-fidelity training signals). The cortico-hippocampal system can then learn compositional representations of these policies (e.g. GVCs) from the statistics of these sequences, an action-version of complementary system, which relates to recent machine learning methods in offline RL. Here, sequence models learn the statistics of behavioral sequences from conventional RL algorithms, after which the sequence model can be used for planning in a manner analogous to planning by inference, as mentioned in Chen et. al. 2021, Janner et. al. 2021.

NorbertZheng commented 2 years ago

Replay: offline state-space construction

If behavior control in a new world is reduced to a state-space composition problem, it becomes important to construct state-spaces rapidly and accurately (offline seems good!), and to store them in memory so they can inform future decisions.

An appealing substrate for this composition is replay. For example, when an animal receives reward, it is important that all other states in the environment are aware of their relative location to the reward. Replay can path integrate away from the reward (that is what exactly TEM-OVC does, ovc-replay!), successively tying (composing) each new goal-vector cell to its respective hippocampal/cortical location (perhaps building landmark cells in hippocampus; this is similar mechanism to the simultaneous grid and place cell replay, but now used to instantiate rewarding policies, instead of ensuring consistency between place and grid representations). After encountering a goal, we want the goal-vector representations to exist across all of space, and especially any start locations. Replay trajectories provide an offline solution; path integrate (offline) GVCs and bind them (via memory) to important locations such as the start state. Now, should the animal return to a state, that state representation already "knows" about its relation to the reward. It is no longer necessary to hold all goal locations in mind, as the state-space composition is stored in memory. image

This idea relates to previous ideas from RL that cast replay as

However, in a generalization framework (outlined in the section above, e.g. TEM, SMP), these two computational processes are subsumed by the single process of composing state-spaces from pre-learnt bases (looks like TD-learning can be used widely, maybe GFlowNet will help?). To test this framework against data, it will be interesting to build a formal understanding of optimal replay patterns under these assumptions. Notably, it will make predictions about

NorbertZheng commented 2 years ago

When neural representations factorize

Spatial representations found in entorhinal cortex, such as grid cells, OVCs, and BVCs, are seemingly factorized, since they compositionally augment the entorhinal grid representation to represent different environment configurations. Recent evidence however, has shown that grid cells warp towards consistently rewarded locations, as demonstrated in Boccara et. al. 2018 and Butler et. al. 2019. Factorized representations do not warp, since warping is an environment-specific phenomenon; warping around rewards does not transfer to different spatial configurations of rewards (a trade-off between generalization and maximization of reward, people are not rational!).

Specifically, there is a computational trade-off between using factorized compositional bases and using bespoke warped representations, e.g. a pressure to generalize versus precisely representing a single task. image

NorbertZheng commented 2 years ago

Open questions

The role of time in memory and cognitive maps

The discussion of cognitive map models so far assumes that learned representations remain stable over time. This clearly cannot be the case, due to representation drift, as mentioned in #5 .

But how can hippocampus maintain a stable representation of space, if the cellular basis of this representation is drifting over time? Generalization models offer a natural solution as, here, hippocampal cells bind multiple factors of the input. Only on factor needs to change for the entire hippocampal representation to change. Representation drift, in this view, is just hippocampal remapping, but now it is not sensory observations or space that has changed, but time instead (this is weird, why time? space evolving with time makes more sense!). image

Hippocampal represents time thought more than just drift. Pure "time cells", for example, emerge when rodents are required to stay still, or run on a wheel, for a particular duration of time in a task, maybe can see this representation as a sensory input, like the clock that will interrupt the computer.

NorbertZheng commented 2 years ago

Interacting levels of abstraction

The real power of abstractions comes when this process can happen repeatedly, so that abstractions can themselves lead to further abstractions, e.g. memory consolidation. The latent space and the corresponding transition rule (due to the configuration of actions) would not have generalized if the T-maze become a W-maze, thus we need something fundamentally new in the models to account for this.

One intriguing possibility is that the different representations observed in fronto-temporal cortices might reflect such a factorization, with entorhinal representations grounded in interactions with the physical environment, while neurons in PFC representing abstract, task-related invariances, such as "location in task", e.g. PFC modulates MEC to form grid characterization. Interestingly, though, the very same vectors can be reused whether it be the oven or the chopping board. This makes a prediction - vector cells that are contextually modulated depending on "location in task". image

NorbertZheng commented 2 years ago

From sequences to other domains of cognition

The models we have described translate the problem of building maps into problems of understanding the structure of possible sequences. This raises two interesting points: