emer / axon

Axon is a spiking, biologically-based neural model driven by predictive error-driven learning, for systems-level models of the brain
BSD 3-Clause "New" or "Revised" License
20 stars 7 forks source link
amygdala backpropagation basal-ganglia error-driven hippocampus neural-network neuroscience neuroscience-inspired-ai prefrontal-cortex spiking-neural-networks

Axon Neural Modeling Framework

Go Report Card GoDoc codecov CI

This is the Go implementation of the Axon algorithm for spiking, biologically based models of cognition, based on the emergent framework. Development of Axon is supported by the Obelisk project at https://astera.org/ and by collaborations with scientists at the University of California Davis, and other institutions around the world.

Axon is the spiking version of Leabra, with several advances. As a backcronym, axon could stand for Adaptive eXcitation Of Noise, reflecting the ability to learn using the power of error-backpropagation in the context of noisy spiking activation. The spiking function of the axon is what was previously missing from Leabra. Axon is used to develop large-scale systems-neuroscience models of the brain, i.e., Computational Cognitive Neuroscience, centered around the Rubicon model of goal-driven, motivated cognition.

Axon and emergent use the Cogent Core GUI framework. See install instructions there. Once those prerequisites are in place, then the simplest way to run a simulation is:

$ core run [platform]

where [platform] is optional (defaults to your local system), and can include android, ios and web!

See the ra25 example for a complete working example, which is intended to be a good starting point for creating your own models. The emergent wiki install page has a tutorial for how to create your own simulation starting from the ra25 example.

Current Status / News

Development

Release

Use core next-release to push a tag at the next patch increment, or core release v* for a specific version tag.

Design and Organization

Data Parallel

As of v1.8.0, data parallel processing of multiple input patterns in parallel using the same weights is supported, as detailed below. For models with simple "one step" independent inputs (i.e., no context required across trials -- iid), a single copy of the existing environment can be used, simply stepping through it in a for loop for each di data parallel index. For models with temporal context (e.g., all deep predictive models, rl, pvlv, pcore, boa), NData copies of the environment must be created and used in turn for each di. See the deep_* models and boa for examples. There is support in the emergent/env code for managing these environments. As usual, see examples/ra25 or other examples as relevant for specific implementation.

    net.SetMaxData(&ss.Context, ss.NData)
net.InitExt(ctx) // clear any existing inputs
for di := uint32(0); di < ctx.NData; di++ {
    ev.Step()
    for _, lnm := range lays {
        ly := ss.Net.AxonLayerByName(lnm)
        pats := ev.State(ly.Nm)
        if pats != nil {
            ly.ApplyExt(ctx, di, pats)
        }
    }
}
net.ApplyExts(ctx) // now required for GPU mode
    trls := int(math32.IntMultipleGE(25, float32(ss.NData))) // 25 nominal trials per epoch

    man.AddStack(etime.Train).AddTime(etime.Run, 5).AddTime(etime.Epoch, 200).AddTimeIncr(etime.Trial, trls, ss.NData).AddTime(etime.Cycle, 200)
    for _, m := range man.Stacks {
            m.Loops[etime.Cycle].OnEnd.InsertBefore("GUI:UpdateNetView", "GUI:CounterUpdate", func() {
                ss.NetViewCounters(etime.Cycle)
            })
            m.Loops[etime.Trial].OnEnd.InsertBefore("GUI:UpdateNetView", "GUI:CounterUpdate", func() {
                ss.NetViewCounters(etime.Trial)
            })
    }

This ensures that the current data index counters are displayed:

func (ss *Sim) NetViewCounters(tm etime.Times) {
    if ss.GUI.ViewUpdate.View == nil {
        return
    }
    di := ss.GUI.ViewUpdate.View.Di
    if tm == etime.Trial {
        ss.TrialStats(di) // get trial stats for current di
    }
    ss.StatCounters(di)
    ss.ViewUpdate.Text = ss.Stats.Print([]string{"Run", "Epoch", "Trial", "TrialName", "Cycle", "UnitErr", "TrlErr", "PhaseDiff"})
    // note: replace above with relevant counters -- from prior StatCounters
}
    case time == etime.Cycle:
        return
    case time == etime.Trial:
        trl := ss.Stats.Int("Trial")
        row = trl
        for di := 0; di < int(ctx.NData); di++ {
            ss.Stats.SetInt("Trial", trl+di)
            ss.TrialStats(di)
            ss.StatCounters(di)
            ss.Logs.LogRowDi(mode, time, row, di)
        }
        return // don't do reg

Overview of the Axon Algorithm

Axon is the spiking version of Leabra, which uses rate-code neurons instead of spiking. Like Leabra, Axon is intended to capture a middle ground between neuroscience, computation, and cognition, providing a computationally effective framework based directly on the biology, to understand how cognitive function emerges from the brain. See Computational Cognitive Neuroscience for a full textbook on the principles and many implemented models.

Pseudocode as a LaTeX doc for Paper Appendix

You can copy the markdown source of this README into a file, and run pandoc on it to convert to LaTeX (or other formats) for inclusion in a paper. As this page is always kept updated, it is best to regenerate from this source -- very easy:

curl "https://raw.githubusercontent.com/emer/axon/main/README.md" -o appendix.md
pandoc appendix.md -f gfm -t latex -o appendix.tex

You can then edit the resulting .tex file to only include the parts you want, etc.

Functional Advantages of Spikes

Aside from biological fidelity, does discrete spiking actually afford any computational / functional advantages over the kind of rate code activation used in Leabra and most abstract neural networks? Perhaps surprisingly, there isn't a particularly strong answer to this question in the broader field, and more work needs to be done to specifically test and document the differences below (we plan to create an Axoff package which is as identical as possible, just with rate codes instead of spikes, to help address this question here).

Meanwhile, based on extensive experience with Axon and Leabra, here are some likely advantages of spikes:

  1. Graded behavior with fast initial responding: Spiking networks can exhibit both very fast initial communication of information and finely graded, proportional responses. In Leabra, if you turn down the activation gain factor, you can get finely graded responses, but the network is slow to respond as activation takes a while to build and propagate, or you can get fast initial responses with high gain, but then it tends to exhibit strongly bimodal, non-graded behavior (especially in the context of bidirectional attractor dynamics). Simon Thorpe and colleagues have emphasized this point about the significant information value carried in the timing of the first wave of spikes (Thorpe et al., 1996). After this first wave, reasonable rates of subsequent spiking (max around 100 Hz or every 10 msec) can send a graded rate-code-like signal over time.

  2. Graded, permeable attractors: In rate-code networks, each neuron is continuously broadcasting its graded activation value on every time step, creating a "wall of activation" that is relatively impermeable to new signals. By contrast, in spiking networks, there are always gaps in the signal, which can allow new information to more easily penetrate and shape the ongoing "conversation" among neurons. One manifestation of this difference is that Leabra models typically require significant amounts of decay between trials, to allow new inputs to shape the network response, while Axon models do not. This difference is particularly notable in large, deep networks.

  3. Stochastic sampling: The thresholded nature of spiking creates natural variability where small differences in timing can be magnified chaos-like into larger differences in which neurons end up spiking (the first to spike generally inhibits other neurons), contributing to an observed high level of variability in spiking network responding approximated by the Poisson distribution. This is a "natural" form of stochasticity without requiring actual added random noise, and it may be that it ends up retaining more of the relevant underlying signal as a result. Specifically, attempts to add noise in Leabra never ended up improving network performance, presumably because it directly diluted the signal with noise. By contrast, the natural stochasticity in Axon networks appears to do a good job of probabilistically sampling noise distributions (McKee et al, 2021), and still allows high levels of performance (although consistently perfect responding is generally not observed; nor is it in people or other animals).

  4. Time is a real additional dimension: Spiking networks exhibit a number of additional time-dependent effects that are absent in rate-code models with their continuous communication, including synchrony, temporal summation, bursting vs. pausing, etc. A number of theories have speculated about the additional signaling capabilities of synchrony or bursting, e.g., as an additional attentional or binding factor. In Axon, we don't specifically build in any such mechanisms, and the relevant learning mechanisms required to leverage this additional dimension are not necessarily obviously present in the biology (see the kinase model and discussion). Nevertheless it is likely that there are important emergent temporal dynamics, and the version of the learning rule that does afford at least some sensitivity to coincident neural firing does actually work better in practice, so more work needs to be done to understand these issues in the context of the Axon model.

Activation: AdEx Conductance-based Spiking

Temporal and Spatial Dynamics of Dendritic Integration

A key dividing line in biological realism of neural models concerns the inclusion of separate dynamics for dendrites versus the soma, with a considerable literature arguing that significant computational functionality arises from nonlinear dynamics in the dendritic integration process. AdEx is a single compartment "point neuron" model (soma only), and obviously there is a major tradeoff in computational cost associated with modeling dendritic dynamics within individual neurons in any detail. In Axon, we have taken a middle ground (as usual), by including a separate dendritic membrane potential VmDend that better reflects the dynamics of depolarization in the dendrites, relative to the standard AdEx Vm which reflects full integration in the soma. Voltage-gated channels localized in the dendrites, including NMDA and GABA-B, are driven by this VmDend, and doing so results in significantly better performance vs. using the somatic Vm.

Furthermore, synaptic inputs are integrated first by separate pathways, and then integrated into the full somatic conductances, and thus it is possible to implement nonlinear interactions among the different dendritic branches where these different pathways may be organized. This is done specifically in the MSN (medium spiny neurons) of the basal ganglia in the pcore algorithm, and in the PT (pyramidal tract, layer 5IB intrinsic bursting) neurons also implemented in pcore. As noted above, each pathway is also subject to different scaling factors, which while still linear, is critical for enabling models to function properly (e.g., top-down pathways must in general be significantly weaker than bottom-up pathways, to keep the models from hallucinating). These ways of capturing dendritic dynamics probably capture a reasonable proportion of the relevant functional properties in the biology, but more work with direct comparisons with fully detailed compartmental models is necessary to understand these issues better. See the Appendix: Dendritic Dynamics for more discussion.

Inhibitory Competition Function Simulating Effects of Interneurons

The pyramidal cells of the neocortex that are the main target of axon models only send excitatory glutamatergic signals via positive-only discrete spiking communication, and are bidirectionally connected. With all this excitation, it is essential to have pooled inhibition to balance things out and prevent runaway excitatory feedback loops. Inhibitory competition provides many computational benefits for reducing the dimensionality of the neural representations (i.e., sparse distributed representations) and restricting learning to only a small subset of neurons, as discussed extensively in the Comp Cog Neuro textbook. It is likely that the combination of positive-only weights and spiking activations, along with inhibitory competition, is essential for enabling axon to learn in large, deep networks, where more abstract, unconstrained algorithms like the Boltzmann machine fail to scale (paper TBD).

Inhibition is provided in the neocortex primarily by the fast-spiking parvalbumin positive (PV+) and slower-acting somatostatin positive (SST+) inhibitory interneurons in the cortex (Cardin, 2018). Instead of explicitly simulating these neurons, a key simplification in Leabra that eliminated many difficult-to-tune parameters and made the models much more robust overall was the use of a summary inhibitory function. This function directly computes a pooled inhibitory conductance Gi as a function of the feedforward (FF) excitation coming into a Pool of neurons, along with feedback (FB) from the activity level within the pool. Fortuitously, this same FFFB function works well with spiking as well as rate code activations, but it has some biologically implausible properties, and also at a computational level requires multiple layer-level for loops that interfere with full parallelization of the code.

Thus, we are now using the FS-FFFB fast & slow FFFB function that more explicitly captures the contributions of the PV+ and SST+ interneurons, and is based directly on FF and FB spikes, without requiring access to the internal Ge and Act rate-code variables in each neuron. See above link for more info. This function works even better overall than the original FFFB, in addition to providing a much more direct mapping onto the underlying biology.

While most layers and models use the FS-FFFB approximation, Axon does support explicit modelling of inhibitory neurons by using Path.Type = InhibPath.

See the examples/inhib model (from the CCN textbook originally) for an exploration of the basic excitatory and inhibitory dynamics in these models, comparing interneurons with FS-FFFB.

Kinase-based, Trace-enabled Error-backpropagation Learning

A defining feature of Leabra, and Axon, is that learning is error driven, using a temporal difference to represent the error signal as a difference in two states of activity over time: minus (prediction) then plus (outcome). This form of error-driven learning is biologically plausible, by virtue of the use of bidirectional connectivity to convey error signals throughout the network. Any temporal difference arising anywhere in the network can propagate differences throughout the network -- mathematically approximating error backpropagation error gradients (O'Reilly, 1996).

In the original Leabra (and early Axon) formulation, a version of the simple, elegant contrastive hebbian learning (CHL) rule was used:

$$ dW = (x^+ y^+) - (x^- y^-) $$

where the weight change $dW$ is proportional to difference of two hebbian-like products of sending (x) and receiving (y) activity, between the plus and minus phase.

Axon now uses a different formulation of error-driven learning, that accomplishes three major objectives relative to CHL:

  1. Enabling greater sensitivity to small error gradients that can accumulate over time, by computing the error in part based on linear net-input terms instead of the highly non-linear activation terms in CHL (representing something like a spiking rate).
  2. Supporting a temporally extended eligibility trace factor that provides a biologically plausible way of approximating the computationally powerful backprop-through-time (BPTT) algorithm (Bellec et al, 2020).
  3. More directly connecting to the underlying biochemical mechanisms that drive synaptic changes, in terms of calcium-activated kinases, as explored in the more biophysically detailed kinase model.

The detailed derivation of this kinase trace learning mechanism is provided in the Appendix: Kinase-Trace Learning Rule Derivation, and summarized here. The algorithm is presented first at an abstract mathematical level, and then in terms of the underlying biological mechanisms that actually implement it.

Also, deep_{net.go, layer.go, paths.go} implement the extra anatomically motivated mechanisms for predictive error-driven learning OReilly et al., 2021, where the minus phase represents a prediction and the plus phase represents the actual outcome. Biologically, we hypothesize that the two pathways of connectivity into the Pulvinar nucleus of the thalamus convey a top-down prediction and a bottom-up ground-truth outcome, respectively, providing an abundant source of error signals without requiring an explicit teacher.

Error Gradient and Credit Assignment

Mathematically, error-driven learning has two components: the error gradient, which reflects the contribution of the receiving neuron to the overall network error, and the credit assignment factor that determines how much credit / blame for this error gradient to assign to each sending neuron:

dW = Error * Credit

In the simplest form of error-driven learning, the delta rule, these two terms are:

$$ dW = (y^+ - y^-) x $$

where $y^+$ is the target activity of the receiving neuron in the plus phase vs. its actual activity $y^-$ in the minus phase (this difference represents the error gradient), and $x$ is the sending neuron activity, which serves as the credit assignment. Thus, more active senders get more of the credit / blame (and completely inactive neurons escape any).

In the mathematics of backpropagation, with a rearrangement of terms as in (O'Reilly, 1996), the error gradient factor in a bidirectionally connected network can be computed as a function of the difference between net input like terms in the plus and minus phases, which are the dot product of sending activations times the weights:

$$ g = \sum_i w_i x_i $$

as:

$$ \text{Error} = g^+ - g^- $$

while the credit assignment factor is the sending activation times the derivative of the receiving activation:

$$ \text{Credit} = x_i y' $$

Error Gradient: Linear Net-Input and Nonlinear Activation

The actual form of error-gradient term used in Axon includes a contribution of the receiving unit activity in addition to the net-input term:

$$ \text{Error} = (g^+ + \gamma y^+) - (g^- + \gamma y^-) $$

where $\gamma$ (about .3 in effect by default) weights the contribution of receiving activity to the error term, relative to the net input.

The inclusion of the receiving activation, in deviation from the standard error backpropagation equations, is necessary for learning to work effectively in the context of inhibitory competition and sparse distributed representations, where neurons have a highly nonlinear activation dynamic, and most are not active at all, despite having significant excitatory net input. In particular, it is often the case that a given neuron will be out-competed (inhibited) in the plus phase by other neurons, despite having relatively consistent levels of excitatory input across both phases. By including the activations in the error signal factor, this change in receiving activity level will show up in the error term, and cause the weight changes to reflect the fact that this neuron may not be as useful for the current trial. On the other hand, using purely activation-based terms in the error signal as in the CHL version above makes the learning too nonlinear and loses the beneficial graded nature of the linear net-input based gradient, where small error gradients can accumulate over time to continually drive learning. In models using CHL, learning would often just get "stuck" at a given intermediate level. Thus, as is often the case, a balanced approach combining both factors works best.

It is notable that the standard backpropagation equations do not include a contribution of the receiving activity in the error signal factor, and thus drive every unit to learn based on linearly re-weighted versions of the same overall error gradient signal. However, a critical difference between backprop and Axon is that backprop nets rely extensively on negative synaptic weight values, which can thus change the sign of the error-gradient factor. By contrast, the positive-only weights in Axon mean that the net-input factor is monotonically and positively related to the strength of the sending activations, resulting in much more homogenous error signals across the layer. Thus, the inclusion of the activation term in the error-signal computation can also be seen as a way of producing greater neural differentiation in the context of this positive-only weight constraint.

Biological basis

From a biological perspective, the combination of net-input and receiving activity terms reflects the two main sources of Ca in the receiving neuron's dendritic spines: NMDA and VGCCs (voltage-gated calcium channels). NMDA channels are driven by sending-neuron glutamate release as in the net-input factor, and VGCCs are driven exclusively by receiving neuron spiking activity. Although the NMDA channel also reflects receiving neuron membrane depolarization due to the need for magnesium ion unblocking, in practice there is often sufficient depolarization in the dendritic compartments, such that Ca influx is mostly a function of sending activity. Furthermore, the receiving depolarization factor in NMDA is also captured by the receiving activity term, which thus reflects both VGCC and the receiving activity dependent aspect of the NMDA channel.

In the Axon implementation, the NMDA and VGCC Ca influx is directly used, instead of the abstract equation above, providing an entirely biologically grounded basis for the learning mechanism. This Ca influx is integrated over multiple cascaded variables that reflect the biochemical steps from Ca to calcium calmodulin (CaM) to CaMKII (CaM kinase II), which is a central kinase involved in controlling synaptic plasticity bidirectionally (LTP = long-term potentiation and LTD = long-term depression; Coultrap & Bayer, 2012).

The critical subtraction necessary to extract the error gradient from the temporal difference in this Ca signal over time is hypothesized to arise from the competitive binding dynamics between CaMKII and another kinase called DAPK1 (death-associated protein kinase 1), which has been shown to drive LTD when it out-competes CaMKII for binding to a particular location (N2B) on the NMDA receptor (Goodell et al., 2017). Mathematically, if DAPK1 integrates at a slower rate than CaMKII, this competitive dynamic computes a temporal derivative exactly as required for the error gradient equation. For example, when Ca increases in the later plus phase relative to the earlier minus phase, then the faster CaMKII responds more quickly, and thus out-competes DAPK1 and drives LTP (i.e., the temporal derivative is positive). Conversely, if Ca decreases in the plus vs. minus phase, DAPK1 remains more activated than CaMKII, and LTD ensues. When Ca is consistent over time, the two play to a draw, and the synaptic weight remains the same. For more details, see the biophysically detailed kinase model. This model also shows how a fully continuous-time mechanism with no artificial knowledge of the theta cycle can drive appropriate error-driven learning.

Critically, there is now direct evidence supporting the idea that pyramidal neuron synapses actually follow this temporal difference dynamic. In the experiment in Jang et al., 2023, neurons were driven pre and postsynaptically with patterns of minus -- plus phase activity (e.g., 25 Hz then 50 Hz) and changes in synaptic efficacy recorded. The results showed all of the qualitative patterns for the temporal difference mechanism: LTP for increasing minus-plus, LTD for decreasing, and no change for consistent activity across a 200 msec theta cycle. These results hold despite differences in overall firing rates.

Credit Assignment: Temporal Eligibility Trace

The extra mathematical steps taken in (O'Reilly, 1996) to get from backpropagation to the CHL algorithm end up eliminating the factorization of the learning rule into clear Error vs. Credit terms. While this produces a nice simple equation, it makes it essentially impossible to apply the results of Bellec et al., (2020), who showed that backprop-through-time can be approximated through the use of a credit-assignment term that serves as a kind of temporal eligibility trace, integrating sender-times-receiver activity over a window of prior time steps.

By adopting the above form of error gradient term, we can now also adopt this trace-based credit assignment term as:

$$ \text{Credit} = \langle x y \rangle_t y' $$

where the angle-bracket expression indicates an exponentially weighted moving average (EWMA) of the sender and receiver activation product.

The most computationally effective form of learning goes one step further in computing this credit-assignment trace factor, by integrating spike-driven activity traces (representing calcium currents) within a given theta cycle of activity, in addition to integrating across multiple such theta-cycle "trials" of activity for the eligibility trace. This is significantly more computationally expensive, as it requires synapse-level integration of the calcium currents on the millisecond-level timescale (a highly optimized form of computation is used so updates occur only at the time of spiking, resulting in a roughly 2x increase in computational cost overall).

The final term in the credit assignment factor is the derivative of the receiving activation function, $y'$, which would be rather difficult to compute exactly for the actual AdEx spiking dynamics used in Axon. Fortunately, using the derivative of a sigmoid-shaped logistic function works very well in practice, and captures the essential functional logic for this derivative: learning should be maximized when the receiving neuron is in its most sensitive part of its activation function (i.e., when the derivative is the highest), and minimized when it is basically "pinned" against either the upper or lower extremes. Specifically the derivative of the logistic is:

$$ y' = y (1-y) $$

which is maximal at y = .5 and zero at either 0 or 1. In Axon, this is computed using a time-integrated spike-driven Ca-like term (CaSpkD), with the max value across the layer used instead of the fixed 1 constant. In addition, it is useful to use an additional factor that reflects the normalized difference in receiving spiking across the minus and plus phase, which can be thought of as an empirical measure of the sensitivity of the receiving neuron to changes over time:

$$ y' = y (1-y) \frac{y^+ - y^-}{\text{MAX}(y^+, y^-)} $$

Stabilization and Rescaling Mechanisms

A collection of biologically motivated mechanisms are used to provide a stronger "backbone" or "spine" for the otherwise somewhat "squishy" learning that emerges from the above error-driven learning mechanisms, serving to stabilize learning over longer time scales, and prevent parasitic positive feedback loops that otherwise plague these bidirectionally connected networks. These positive feedback loops emerge because the networks tend to settle into stable attractor states due to the bidirectional, generally symmetric connectivity, and there is a tendency for a few such states to get broader and broader, capturing more and more of the "representational space". The credit assignment process, which is based on activation, contributes to this "rich get richer" dynamic where the most active neurons experience the greatest weight changes. We colloquially refer to this as the "hog unit" problem, where a small number of units start to hog the representational space, and it represents a major practical barrier to effective learning if not managed properly. Note that this problem does not arise in the vast majority of purely feedforward networks used in the broader neural network field, which do not exhibit attractor dynamics. However, this kind of phenomenon is problematic in other frameworks with the potential for such positive feedback loops, such as on-policy reinforcement learning or generative adversarial networks.

Metaphorically, various forms of equalizing taxation and wealth redistribution are required to level the playing field. The set of stabilizing, anti-hog mechanisms in Axon include:

  1. SWt: structural, slowly adapting weights. In addition to the usual learning weights driven by the above equations, we introduce a much more slowly adapting, multiplicative SWt that represents the biological properties of the dendritic spine -- these SWts "literally" give the model a spine! Spines are structural complexes where all the synaptic machinery is organized, and they slowly grow and shrink via genetically controlled, activity-dependent protein remodeling processes, primarily involving the actin fibers also found in muscles. A significant amount of spine remodeling takes place during sleep -- so the SWt updating represents a simple model of sleep effects.

    The SWt is multiplicative in the sense that larger vs. smaller spines provide more or less room for the AMPA receptors that constitute the adaptive weight value. The net effect is that the more rapid trial-by-trial weight changes are constrained by this more slowly adapting multiplicative factor, preventing more extreme changes. Furthermore, the SWt values are constrained by a zero-sum dynamic relative to the set of receiving connections into a given neuron, preventing the neuron from increasing all of its weights higher and hogging the space. The SWt is also initialized with all of the randomness associated with the initial weights, and preserving this source of random variation, preventing weights from becoming too self-similar, is another important function.

  2. Target activity levels: There is extensive evidence from Gina Turrigiano and collaborators, among others, that synapses are homeostatically rescaled to maintain target levels of overall activity, which vary across individual neurons (e.g., Torrado Pacheco et al., 2021). Axon simulates this process, at the same slower timescale as updating the SWts (likewise associated with sleep), which are also involved in the rescaling process. The target activity levels can also slowly adapt over time, similar to an adaptive bias weight that absorbs the "DC" component of the learning signal in Schraudolph (1998), but this adaptation is typically subject to a zero-sum constraint, so any increase in activity in one neuron must be compensated for by reductions elsewhere.

    This is similar to a major function performed by the BCM learning algorithm in the Leabra framework -- by moving this mechanism into a longer time-scale outer-loop mechanism (consistent with Turrigiano's data), it worked much more effectively. By contrast, the BCM learning ended up interfering with the error-driven learning signal, and required relatively quick time-constants to adapt responsively as a neuron's activity started to change.

  3. Zero-sum weight changes: In some cases it can also be useful to constrain the faster error-driven weight changes to be zero-sum, which is supported by an optional parameter. This zero-sum logic was nicely articulated by Schraudolph (1998), and is implemented in the widely used ResNet models.

  4. Soft bounding and contrast enhancement: To keep individual weight magnitudes bounded, we use a standard exponential-approach "soft bounding" dynamic (increases are multiplied by $1 - \text{weight}$; decreases by $\text{weight}$). In addition, as developed in the Leabra model, it is useful to add a contrast enhancement mechanism to counteract the compressive effects of this soft bounding, so that effective weights span the full range of weight values.

Axon Algorithm Equations

The pseudocode for Axon is given here, showing exactly how the pieces of the algorithm fit together, using the equations and variables from the actual code. Optimizations and special cases are omitted -- for that level of detail, the actual code is the definitive reference. For example, the time constants Tau are shown using division but in the code a pre-computed Dt = 1/Tau is used because multiplication is generally significantly faster than division.

Timing and Organization of Computation

Axon is organized around a 200 msec theta cycle (5 Hz), which is perhaps not coincidently the modal peak for the duration of an eye fixation, and can be thought of as two 100 msec alpha cycles, which together comprise the minimal unit for predictive error driven learning according to the deep predictive learning framework. Note that Leabra worked well with just a 100 msec alpha cycle, but it has not been possible to get the temporal difference error-driven learning mechanism to work at that scale with spiking in Axon, while it works very well at 200 msec. The time scales are:

Variables

Neuron

The axon.Neuron struct contains all the neuron (unit) level variables, and the axon.Layer contains a simple Go slice of these variables.

Spiking, Activation

Major conductances, Vm

Calcium for learning

Stats, aggregate values

Long-term average activation, set point for synaptic scaling

ISI for computing rate-code activation

Noise

Ge, Gi integration

SST somatostatin inhibition factors

AHP channels: Mahp, Sahp, Gkna

NMDA channels

GABA channels

VGCC voltage gated calcium channels

SKCa small conductance calcium-gated potassium channels

Special layer type variables

Inhib Pools

There is always at least one axon.Pool pool of neurons over which inhibition is computed, and there can also be pools for subsets of neurons that correspond to hypercolumns (or max pooling in a convolutional neural network), and support more local inhibitory dynamics.

Synapse

Neurons are connected via synapses parameterized with the following variables, contained in the axon.Synapse struct. The axon.Path contains all of the synaptic connections for all the neurons across a given layer -- there are no Neuron-level data structures.

Activation Update Cycle (every 1 msec): Ge, Gi, Vm, Spike

The axon.Network CycleImpl method in axon/network.go calls the following functions in order:

All of the relevant parameters and most of the equations are in the axon/act.go, axon/inhib.go, and axon/learn.go which correspond to the Act, Inhib and Learn fields in the Layer struct. Default values of parameters are shown in comments below.

PathGatherSpikes: for each Path

Path.GValues integrates two synaptic conductance values G per receiving neuron index, using Ge time constants for excitatory synapses, and Gi for inhibitory. These values were sent before in SendSpike and stored in the Path.GBuf slice (see below). In principle synaptic conductance is computed using a standard alpha function double-exponential with one time constant for the rise and another for the decay. In practice, the rise time is < 1 msec and thus it is simpler to just add the new raw and decay (decay tau = 5 msec):

GiFmSpikes: for each Layer

Layer.Pool[*].Inhib pools are updated based on FFsRaw and FBsRaw which are accumulated during SendSpike

Normalize raw values:

Fast spiking (FS) PV from FFs and FBs, with decay:

Slow spiking (SS) SST from FBs only, with facilitation factor SSf:

Overall inhibition:

CycleNeuron

There are three major steps here: GInteg, SpikeFmG, and CaFmSpike, the last of which updates Ca values used in learning.

The logic of spike communication is complicated by the presence of synaptic delays, in Path.Com.Delay, such that spikes are accumulated into one place on a ring buffer that is organized by recv neuron and then delay per each neuron, while being read out of another location in this buffer:

GBuf: [RecvNeuron][MaxDelay]

RN: 0     1     2         <- recv neuron indexes
DI: 0 1 2 0 1 2 0 1 2     <- delay indexes
C0: ^ v                   <- cycle 0, ring index: ^ = store, v = read
C1:   ^ v                 <- cycle 1, shift over by 1 -- overwrite last read
C2: v   ^                 <- cycle 2: read out value stored on C0 -- index wraps around

Because there are so few neurons spiking at any time, it is very efficient to use a sender-based dynamic to write spikes only for senders that spiked -- this is what the SendSpike method does. However, multiple senders could be trying to write to the same place in the GBuf. We have a GBuf per each pathway, so that means that threading can only be pathway-parallel (fairly coarse-grained). For the GPU, an atomic add operation is used to aggregate to GBuf, operating with sending neuron-level parallelism.

The first-pass reading of recv spikes happens in PathGatherSpikes at the Path level, and it must always operate on all neurons (dense computation). It iterates over recv neurons and accumulates the read-out value from GBuf based on synaptic delay, into the GValues, which grabs a GRaw value from GBuf current read position, and then does temporal integration of this value into GSyn which represents the synaptic conductance with exponential decay (and immediate rise -- nominally an alpha function with exponential rise but that rise time is below the 1 msec resolution).

Finally, NeuronGatherSpikes iterates over all recv neurons, and gathers the GRaw and GSyn values across the RecvPaths into the relevant Neuron-level variables: GeRaw, GeSyn; GiRaw, GiSyn; GModRaw, GModSyn for the three different types of pathways: ExcitatoryG, InhibitoryG, and ModulatoryG.

TODO: inhibition!

GInteg: Integrate G*Raw and G*Syn from Recv Paths, other G*s

Iterates over Recv Paths:

Then all the special conductances:

And add in the pool inhib Gi computed above.

External (Ext) Input from Clamped layers

For "visible" layers that are driven by external input (InputLayer, TargetLayer), their input comes from the Ext neuron value (set via ApplyInputs method), which drives the GeExt variable as a temporally integrated excitatory synaptic conductance value. TargetLayers are driven by regular synaptic inputs during the minus phase, but then only by Ext input during the plus phase. Likewise, the PulvinarLayer in the Deep predictive learning framework is driven by driver inputs in the plus phase.

It is important that the FS-FFFB inhibition function properly adjusts for the Target-like layers, so that inhibition is computed on all the synaptic input in the minus phase, but only on the GeExt-based input in the plus phase. The Clamped flag on the fsfffb.Inhib state makes this happen.

Importantly, if GeExt is weak (less than fsfffb ClampExtMin), then all the synaptic input is used, so that the plus phase activity matches the activity in the minus phase.

SpikeFmG: Compute Vm and Spikes from all the G's

Vm is incremented by the net current Inet summing all the conductances and an exponential factor capturing the Hodgkin Huxley spiking dynamics:

In the Axon implementation, 2 smaller steps are taken in integrating the Vm (i.e., .5 msec resolution), and a midpoint value is used in computing the exponential factor, to avoid numerical instability while maintaining a 1 msec overall update rate.

If the neuron has just spiked within the Tr refractory time window (3 msec default), then Vm just decays toward a refractory potential VmR (0.3 = rest potential) with a time constant of RTau (1.6667 msec), with the last step constrained to reach VmR exactly.

VmDend is updated in the same way as Vm, except that the exponential term is muted by an additional factor of GbarExp (0.2), and it is still updated by conductances during the refractory period, with an additional leak conductance of GbarR (3) added to drive the potential downward toward the resting potential. Thus, consistent with detailed compartmental models and electrophysiological data, VmDend exhibits more sustained depolarization, which keeps the NMDA and GABA-B currents more stabily activated, as shown in the Appendix: Dendritic Dynamics.

VmDend also has an additional contribution from the SSGi slow-spiking inhibition (2 * SSGi by default), reflecting the fact that the SST+ neurons target the distal dendrites. This is important functionally to counter a positive feedback loop from NMDA channels, as discussed here: FS-FFFB.

Spike is set to 1 when Vm > ExpThr (0.9, for default case where Exp function is being used), and the ISI (inter-spike-interval) counter, and time-averaged ISIAvg are updated:

If the neuron did not spike, then ISI++ is incremented.

CaFmSpike: CaLrn (NMDA + VGCC) and Simple Spike-driven Ca Signals

The core Ca calcium value that drives the trace - kinase learning rule is stored in the CaLrn neuron variable, as a sum of NMDA and VGCC calcium influx:

Where Norm (80) renormalizes the concentration-based factors to a range that works well for learning.

In larger networks, directly using the calcium flux from the VGCC channel (VgccCa) works well, but in smaller networks it typically works better to use a simpler approximation to the VGCC Ca that purely reflects spiking rate and not the other voltage-related factors that affect Ca in the actual channels. In either case, VgccCa is EWMA of with a factor reflecting buffering and diffusion of these highly transient Ca signals driven by spiking:

    if SpkVGCC {
        VgccCa = SpkVgccCa * Spike   // SpkVgccCa = 35
    }
    VgccCaInt += VgccCa - VgccCaInt / VgccTau  // VgccTau = 10 msec

This immediate CaLrn value is then subject to multiple levels of additional integration processes, reflecting the CaM calmodulin -> CaMKII -> DAPK1 cascades, into the CaM, CaP and CaD variables. The same time constants are used for as smoothing factors for these processes across various different variables, and are defined in the kinase package, as follows:

The cascading update looks like this:

    CaM += (CaLrn - CaM) / MTau
    CaP += (CaM - CaP) / PTau
    CaD += (CaP - CaD) / DTau

As shown below for the DWt function, the Error gradient is:

In addition, the Ca trace used for synaptic-level integration for the trace-based Credit assignment factor (see SynCaSend and SynCaRecv below) is updated with a smoothing factor of 1/SynTau (1/30 msec) which defines the time window over which pre * post synaptic activity interacts in updating the synaptic-level trace:

Finally, various peripheral aspects of learning (learning rate modulation, thresholds, etc) and some performance statistics use simple cascaded time-integrals of spike-driven Ca at the Neuron level, in the CaSpk variables. The initial Ca level from spiking is just multiplied by a gain factor:

The cascaded integration of these variables is:

    CaSpkM += (SpikeG * Spike - CaSpkM) / MTau
    CaSpkP += (CaSpkM - CaSpkP) / PTau
    CaSpkD += (CaSpkP - CaSpkD) / DTau

SendSpike

For each Neuron, if Spike != 0, then iterate over SendPaths for that layer, and for each sending Synapse, Path.GScale.Scale (computed pathway scaling, see Projection scaling) is multiplied by the synaptic weight Wt, and added into the GBuf buffer for each receiving neuron, at the ring index for Com.Delay (such that it will be added that many cycles later in PathGatherSpikes). The PIBuf for the inhibitory pool of each receiving neuron is also incremented.

This is expensive computationally because it requires traversing all of the synapses for each sending neuron, in a sparse manner due to the fact that few neurons are typically spiking at any given cycle.

SynCaSend, SynCaRecv

If synapse-level calcium (Ca) is being used for the trace Credit assignment factor in learning, then two pathway-level functions are called across all pathways, which are optimized to first filter by any sending neurons that have just spiked (SynCaSend) and then any receiving neurons that spiked (SynCaRecv) -- Ca only needs to be updated in these two cases. This major opmitimization is only possible when using the simplified purely spike-driven form of Ca as in the CaSpk vars above. Another optimization is to exclude any neurons for which CaSpkP and CaSpkD are below a low update threshold UpdateThr = 0.01.

After filtering, the basic cascaded integration shown above is performed on synapse-level variables where the immediate driving Ca value is the product of CaSyn on the recv and send neurons times a SpikeG gain factor:

The cascading is optimized to occur only at the time of spiking by looping over intervening time steps since last spike-driven update.

Learning: DWt, WtFmDWt

DWt

After 200 cycles of neuron updating per above, the synaptic weight changes DWt are computed according to the trace-kinase learning algorithm:

The Error gradient component of this weight change was shown above, in terms of the receiving neuron's CaP (CaMKII LTP) and CaD (DAPK1 LTD) kinase activity:

The Credit assignment component is the trace, based on the longest time-scale cascaded synaptic Ca value, CaD as updated in the above functions:

Along with a RLRate factor that represents the derivative of the receiving activation, which is updated for each neuron at the end of the plus phase prior to doing DWt:

Thus, the complete learning function is:

The soft weight bounding is applied at the time of computing the DWt, as a function of the Linear weight value LWt (explained below in WtFmDWt) as follows:

    if DWt > 0 {
        DWt *= (1 - LWt)
    } else {
        DWt *= LWt
    }

There are also alternative learning functions supported, that use neuron-level Ca instead of synaptic Ca (selected by the NeuronCa option) for the trace factor, and are thus significantly faster, but do not work as well in general, especially in large networks on challenging tasks.

WtFmDWt

Synaptic weights Wt are typically updated after every weight change, but multiple DWt changes can be added up in a mini-batch (often when doing data-parallel learning across multiple processors). With the SWt and contrast enhancement required to compensate for the soft weight bounding (which was also a long-time part of the Leabra algorithm), there are three different weight values at each synapse:

First, the LWt is updated from the DWt:

Then Wt is updated therefrom:

Thus, the SWt provides a multiplicative constraint on the weights, and the LWt drives a more extreme, contrast-enhanced value on the weights, which counteracts the compression created by the soft weight bounding.

SlowAdapt Updates: Target Activity Rescaling, SWt

Every SlowInterval (100) Trials, the SlowAdapt methods are called on all Layers and then Projections, which perform the following. These are essential constraints on learning that break the positive feedback loops while preserving effective error-driven learning.

Target vs. Average Activity

First, when the network is initialized, a TrgAvg value is assigned to each neuron by uniformly sampling within a range of target values (0.5 - 2.0) and permuting the values among the set of neurons. This target is then updated as a function of the receiving unit error-gradient, subject to a zero-sum constraint across the relevant Pool of neurons:

After every Trial, the neuron's actual average activation ActAvg is updated in an EWMA of manner:

Then, in SlowAdapt, the ActAvg values are normalized into a proportion relative to the averages within the Pool a neuron belongs to, and the difference between this and the TrgAvg recorded:

This AvgDif value then drives synaptic rescaling per below.

SWt Update

The SWt is updated from DSWt which is accumulated from all the ensuing DWt values, with soft bounding applied and zero-sum:

    if DSWt >= 0 {
        DSWt *= (SWt.Limit.Max - SWt)
    } else {
        DSWt *= (SWt - SWt.Limit.Min)
    }
    SWt += SWt.Adapt.LRate * (DSWt - AVG(DSWt)) // AVG over Recv synapses per Path
    LWt = SigInverse(Wt / SWt)   // inverse of sigmoid

The learning rate here is typically slow, on the order 0.001 or even lower in large networks.

The updating of LWt in this way preserves the current Wt value despite changes in the SWt multiplier -- in effect the LWt absorbs the changes in SWt to preserve the current Wt value.

Synaptic Rescaling

Finally, the LWt values are rescaled as a function of the AvgDif values reflecting the deviation in average activity relative to the target as computed above, using a soft-bounding update:

    if AvgDif > 0 {
        LWt += SynScaleRate * (1 - LWt) * AvgDif * SWt
    } else {
        LWt += SynScaleRate * LWt * AvgDif * SWt
    }
    Wt = SWt * Sigmoid(LWt)

This updates all the learned weights, and consequently the effective weights, moving in the direction to reduce the difference between the actual average activation and the target.

Projection scaling

The Ge and Gi synaptic conductances computed from a given pathway from one layer to the next reflect the number of receptors currently open and capable of passing current, which is a function of the activity of the sending layer, and total number of synapses. We use a set of equations to automatically normalize (rescale) these factors across different pathways, so that each pathway has roughly an equal influence on the receiving neuron, by default.

The most important factor to be mindful of for this automatic rescaling process is the expected activity level in a given sending layer. This is set initially to Layer.Inhib.ActAvg.Nominal, and adapted from there by the various other parameters in that Inhib.ActAvg struct. It is a good idea in general to set that Nominal value to a reasonable estimate of the proportion of activity you expect in the layer, and in very small networks, it is typically much better to just set the Fixed flag and keep this Nominal value as such, as otherwise the automatically computed averages can fluctuate significantly and thus create corresponding changes in input scaling. The default UseFirst flag tries to avoid the dependence on the Nominal values but sometimes the first value may not be very representative, so it is better to set Nominal and turn off UseFirst for more reliable performance.

Furthermore, we add two tunable parameters that further scale the overall conductance received from a given pathway (one in a relative way compared to other pathways, and the other a simple absolute multiplicative scaling factor). These are some of the most important parameters to configure in the model -- in particular the strength of top-down "back" pathways typically must be relatively weak compared to bottom-up forward pathways (e.g., a relative scaling factor of 0.1 or 0.2 relative to the forward pathways).

The scaling contributions of these two factors are:

Thus, all the Rel factors contribute in proportion to their relative value compared to the sum of all such factors across all receiving pathways into a layer, while Abs just multiplies directly.

In general, you want to adjust the Rel factors, to keep the total Ge and Gi levels relatively constant, while just shifting the relative contributions. In the relatively rare case where the overall Ge levels are too high or too low, you should adjust the Abs values to compensate.

Typically the Ge value should be between .5 and 1, to maintain a reasonably responsive neural response, and avoid numerical integration instabilities and saturation that can arise if the values get too high. You can record the Layer.Pools[0].Inhib.Ge.Avg and .Max values at the epoch level to see how these are looking -- this is especially important in large networks, and those with unusual, complex patterns of connectivity, where things might get out of whack.

Automatic Rescaling

Here are the relevant factors that are used to compute the automatic rescaling to take into account the expected activity level on the sending layer, and the number of connections in the pathway. The actual code is in layer.go GScaleFmAvgAct() and act.go SLayActScale.

This sc factor multiplies the GScale factor as computed above.

Important Stats

The only way to manage the complexity of large spiking nets is to develop advanced statistics that reveal what is going on, especially when things go wrong. These include:

Appendix: Specialized Algorithms: BG, PFC, DA, etc

This repository contains specialized additions to the core algorithm described above, which are implemented via specific LayerTypes and PathTypes, in _net.go, _layers.go, and _paths.go files:

Appendix: Kinase-Trace Learning Rule Derivation

To begin, the original GeneRec (OReilly, 1996) derivation of CHL (contrastive hebbian learning) from error backpropagation goes like this:

$$ \frac{\partial E}{\partial w}= \frac{\partial E}{\partial y} \frac{\partial y}{\partial g} \frac{\partial g}{\partial w}$$

where E is overall error, w is the weight, y is recv unit activity, g is recv conductance (net input), and x is sending activity. For a simple neural network:

$$ g = \sum x w $$

$$ y = f(g) $$

This chain rule turns into:

$$ dW = \frac{\partial E}{\partial w} = \left[ \left( \sum_i x_i^+ - \sum_i x_i^- \right )w \right] y' x = (g^+ - g^-) y' x$$

Thus, the Error factor is $(g^+ - g^-)$ and $y' x$ is the Credit factor. In words, the error signal is received by each unit in the form of their weighted net input from all other neurons -- the error is the temporal difference in this net input signal between the plus and minus phases. And the credit assignment factor is the sending unit activity x times the derivative of activation function.

The presence of this derivative is critical -- and has many tradeoffs embedded within it, as discussed later (e.g., the ReLU eliminates the derivative by using a mostly linear function, and thereby eliminates the vanishing gradient problem that otherwise occurs in sigmoidal activation functions).

The original GeneRec derivation of CHL mixes these factors by approximating the derivative of the activation function using the discrete difference in receiving activation state, such that:

$$ (g^+ - g^-) y' \approx y^+ - y^- $$

In the GeneRec derivation, the approximate midpoint integration method, and symmetry preservation, cause the terms to get mixed together with the sending activations, producing the CHL algorithm.

To derive the new trace-enabling rule, we avoid this mixing, and explore learning using the more separable Error Credit form. In practice, the key issue is on what variable is the temporal difference computed*: just using raw net input turns out to be too diffuse -- the units end up computing too similar of error gradients, and the credit assignment is not quite sufficient to separate them out.

In the Axon framework in particular, the weights are constrained to be positive, and especially at the start of learning, the net input terms are all fairly close in values across units. The lateral inhibition provides the critical differentiation so that only a subset of neurons are active, and thus having some contribution of the actual receiving activity is critical for a learning rule that ends up having different neurons specializing on different aspects of the problem. The relative lack of this kind of differential receiver-based credit assignment in backprop nets is a critical difference from the CHL learning rule -- in the GeneRec derivation, it arises from making the learning rule symmetric, so that the credit assignment factor includes both sides of the synapse.

In short, backprop is at one end of a continuum where the only credit assignment factor is presynaptic activity, and existing weights provide a "filter" through which the Error term is processed. At the other end is the symmetric CHL equation where pre post (xy*) is the credit assignment factor in effect, and the "trace" equation is somewhere in between.

Appendix: Neural data for parameters

See chans for good sources for many constants and equations.

Axonal conduction delays:

AMPA rise, decay times:

GABAa rise, decay times:

Appendix: Dendritic Dynamics

A series of models published around the year 2000 investigated the role of active dendritic channels on signal integration across different dendritic compartments (Migliore et al, 1999; Poirazi et al, 2003; Jarsky et al, 2005) -- see Spruston (2008), Poirazi & Papoutsi (2020) for reviews. A common conclusion was that the A-type K channel can be inactivated as a result of elevated Vm in the dendrite, driving a nonlinear gating-like interaction between dendritic inputs: when enough input comes in (e.g., from 2 different pathways), then the rest of the inputs are all integrated more-or-less linearly, but below this critical threshold, inputs are much more damped by the active A-type K channels. There are also other complications associated with VGCC L-type and T-type voltage-gated Ca channels which can drive Ca spikes, to amplify regular AMPA conductances, relative weakness and attenuation of active HH Na spiking channels in dendrites, and issues of where inhibition comes in, etc. See following figure from Spruston (2008) for a summary of some key traces:

Figure: Figure 5 from Spruston (2008), showing mutual dependence on two dendritic compartments for synaptic integration in panel a, and also significant differences in temporal duration of elevated Vm in dendrites vs. soma.

Here are some specific considerations and changes to capture some of these dynamics:

Figure: VmDend now has dynamics that reflect weaker trace of soma spiking -- initial results suggest this improves performance by better engaging the NMDA / GABAB channels.

TODO: GaoGrahamZhouEtAl20

References