Substitution pooling experiments

floybix / comportex-archived

Private fork of Comportex

1 stars 0 forks source link

Substitution pooling experiments #1

Open robjfr opened 9 years ago

robjfr commented 9 years ago

Posting this in a third place! It struck me it makes most sense to discuss it close to the code.

@floybix as part of these substitution pooling experiments you've implemented a concatenated "context space" representation, which concatenates all sequences (or maybe you allow some decay??) into a single SDR.

As a first application you used this to give extended context (as I understand it) to input states via feedback.

And in these last commits you implement code to find overlaps between cells in different substrings in this concatenated "context space" representation.

Those are two applications of such a concatenated "context space" representation.

I now think full substitution pooling might be implemented in as few as two combined steps applied to such a representation (two steps: suggestive of oscillation in the cortex??)

Those steps would be, for a presented sequence:

1) Generalize columns based on cell states 2) Group columns by number of cell states

To make it concrete, in the context of your latest "overlap" commits:

You've identified overlaps (between strings or subsequences.) If we now merge overlapping sequences together based on these overlaps that would be something like my step 1).

Now, I'm not sure what it means to find overlaps out of a single undifferentiated sequence and merge them. I guess what that might mean in concrete terms is that we run over the columns of the concatenated "context space", and anywhere there are shared cells (your "overlap") you take the rest of the cells from one column and add them to the other (so if two columns have enough context similarity, we overlay (stack) all the contexts of each on the other.)

This is the "sensitivity" (thinking) step. Because it varies as we decide the cells of two columns have "enough" similarity.

For step 2) I currently think we would want to take an average of cells per column, as we follow paths through the columns of the concatenated context space (a little like our original "path counting"! But with context information not only kept, but enhanced by generalization this time.) I think this average should go down as more columns are added internal to a pooled state by step 1), so go down as more states are substituted/generalized. And the average should go up as we approach a state boundary/sequence division point, which by definition will be marked by the possibility for the (pooled) state to occur in many contexts.

We might do an experiment to plot this average as we trace all sequences through the "context space". My hunch is it should resolve itself into clear boundaries in some way (probably as highs in the average cells per column, averaged over all paths to a given point.)

But step 2) would be basically a "decoding" step, in the sense of our previous discussions: identifying a sequence of states and unfolding it from the concatenated "context space" representation. (The only difference would be that now the context space representation will have been generalized by step 1) -- note: we might need another layer to put this generalization -- and so its "states", and the sequences they occur in, will have been generalized/simplified/enlarged, too.) You may have other insights how to perform "decoding" which make more sense than my "average cells states per column over all paths" state boundary criterion.

floybix commented 9 years ago

Rob,

I'm afraid I can't make sense of your description of merging overlapping sequences.

The following is how I imagine it.

There are two phases, let's call them observation and reflection. In the observation phase we build up the concatenated context (in a higher layer). In the reflection phase we shut off sensory input and allow the higher layer to drive the lower layer. If we seed the lower layer with a "start" signal (that we would always use to begin sequences), it should produce some kind of reformulation of the original sequence. Like what you have been calling "decoding".

However, my guess is it wouldn't work well to jump straight into producing a complete sequence; we should first allow small parts to be reformulated or generalised first, thus clarifying / stabilising the context.

Here's where generalisation comes in. Thanks to our context layer, whenever cells come active in the lower layer they will learn to be predicted by their context (prior context). Consequently, since the whole context is active during the reflection phase, it will predict all cells -- interpretations -- that have ever appeared in some sufficient subset of that context. These interpretations can be ordered by predictive strength.

For everything I've said so far, we could have only a single cell per column of the higher context layer. Each higher column is driven by specific cells from the lower layer, not cells from the same column. This mapping could be a kind of compression/generalisation too. But I haven't thought it out.

Cutting this off now as arriving at Ocean Park. School holidays this week.

On Tuesday, 20 October 2015, Rob Freeman <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Posting this in a third place! It struck me it makes most sense to discuss it close to the code.

@floybix https://github.com/floybix as part of these substitution pooling experiments you've implemented a concatenated "context space" representation, which concatenates all sequences (or maybe you allow some decay??) into a single SDR.

As a first application you used this to give extended context (as I understand it) to input states via feedback.

And in these last commits you implement code to find overlaps between cells in different substrings in this concatenated "context space" representation.

Those are two applications of such a concatenated "context space" representation.

I now think full substitution pooling might be implemented in as few as two combined steps applied to such a representation (two steps: suggestive of oscillation in the cortex??)

Those steps would be, for a presented sequence:

1) Generalize columns based on cell states 2) Group columns by number of cell states

To make it concrete, in the context of your latest "overlap" commits:

You've identified overlaps (between strings or subsequences.) If we now merge overlapping sequences together based on these overlaps that would be something like my step 1).

Now, I'm not sure what it means to find overlaps out of a single undifferentiated sequence and merge them. I guess what that might mean in concrete terms is that we run over the columns of the concatenated "context space", and anywhere there are shared cells (your "overlap") you take the rest of the cells from one column and add them to the other (so if two columns have enough context similarity, we overlay (stack) all the contexts of each on the other.)

This is the "sensitivity" (thinking) step. Because it varies as we decide the cells of two columns have "enough" similarity.

For step 2) I currently think we would want to take an average of cells per column, as we follow paths through the columns of the concatenated context space (a little like our original "path counting"! But with context information not only kept, but enhanced by generalization this time.) I think this average should go down as more columns are added internal to a pooled state by step 1), so go down as more states are substituted/generalized. And the average should go up as we approach a state boundary/sequence division point, which by definition will be marked by the possibility for the (pooled) state to occur in many contexts.

We might do an experiment to plot this average as we trace all sequences through the "context space". My hunch is it should resolve itself into clear boundaries in some way (probably as highs in the average cells per column, averaged over all paths to a given point.)

But step 2) would be basically a "decoding" step, in the sense of our previous discussions: identifying a sequence of states and unfolding it from the concatenated "context space" representation. (The only difference would be that now the context space representation will have been generalized by step 1) -- note: we might need another layer to put this generalization -- and so its "states", and the sequences they occur in, will have been generalized/simplified/enlarged, too.) You may have other insights how to perform "decoding" which make more sense than my "average cells states per column over all paths" state boundary criterion.

— Reply to this email directly or view it on GitHub https://github.com/floybix/comportex-private/issues/1.

Felix Andrews / 安福立 http://www.neurofractal.org/felix/

robjfr commented 9 years ago

On Thu, Oct 22, 2015 at 12:06 PM, Felix Andrews notifications@github.com wrote:

Rob,

I'm afraid I can't make sense of your description of merging overlapping
sequences.

The following is how I imagine it.

There are two phases, let's call them observation and reflection. In
the observation phase we build up the concatenated context (in a higher
layer). In the reflection phase we shut off sensory input and allow the
higher layer to drive the lower layer. If we seed the lower layer with a
"start" signal (that we would always use to begin sequences), it should
produce some kind of reformulation of the original sequence. Like what you
have been calling "decoding".

You could split it like that.

In that labelling I guess my Step 1) would actually be an extra process in the middle which generalizes according to partial context matches.

It is the "sensitivity" step. In the "context-spaces blog draft" thread, 13 Oct, I talked about it like this:

"Where it gets interesting is when we reduce the sensitivity. Then we will partition up the columns in the same way, but they will not be perfectly ordered. Say we have columns C_1 which are followed by XYZ, and others C_2 which are followed by XYM. Then while C_1 and C_2 would come in different places in the sequence in a full reconstruction, by lowering the sensitivity we might throw away the distinction of the final M, and have C_1 and C_2 both come in both positions, i.e. they would stack up beside each other as newly formed "clouds" just before both XYZ and XYM in the original string."

But instead of projecting out longer contexts in an CLA state, I'm thinking we can just use the "context-space" representation (which you use to give the states more context anyway.) Instead of comparing XYZ and XYM, we can compare all contexts for each state (actually each column of each state) directly In that representation.

And then, to effectively "stack" C_1 and C_2 (in my above description) all you would need to do would be to combine all their contexts in the concatenated space.

However, my guess is it wouldn't work well to jump straight into producing
a complete sequence; we should first allow small parts to be reformulated
or generalised first, thus clarifying / stabilising the context.

That's my Step 1) above.

Here's where generalisation comes in. Thanks to our context layer,
whenever cells come active in the lower layer they will learn to be
predicted by their context (prior context). Consequently, since the whole
context is active during the reflection phase, it will predict all cells
-- interpretations -- that have ever appeared in some sufficient subset of
that context. These interpretations can be ordered by predictive strength.

I think I see that. But that is only generalization over the history of a single cell, right?? (And by association a single column, and a single state. Generalization extended to other states with overlapping columns at most, not states with overlapping contexts.)

The kind of generalization I'm thinking about is generalization between different states which share the same contexts = they can be substituted for each other. That will mean generalization between different states which share the same contexts, not the same state in different contexts (which is what I think your proposal above will reduce to.)

To get what I want (substitution pooling) I think at some point we will need to swap cell state information between columns, not just allow all the cells in a column to influence any decision.

For everything I've said so far, we could have only a single cell per
column of the higher context layer. Each higher column is driven by
specific cells from the lower layer, not cells from the same column. This
mapping could be a kind of compression/generalisation too. But I haven't
thought it out.

Cutting this off now as arriving at Ocean Park. School holidays this week.

I'm rushing too. Coolangatta today. Flying out v. early tomorrow morning for NZ... So there's probably logical errors in there. If you find them you've nailed me :-)

-R

robjfr commented 9 years ago

More musings on this.

On my Step 1):

I've been talking of cells and contexts interchangeably, but I think in practice the cells predicting a given context are different at each different occurrence (true??)

If so we might need to consolidate cells on the columns they predict, in order to get meaningful similarity measures.

On my Step 2):

To segment the concatenated (context space) representation generalized by Step 1), instead of following paths and averaging cells per column, we might want to go to the other extreme and simply try to split the entire sequence successively into halves on the strongest transition.

The "strength" of a transition might be measured by the variety of context cells mapping to a given column context (if the cells predicting a given context are different at each different occurrence, as above.)

I'll try to work through an example to make it more concrete.

I'm thinking of a specific example sentence:

"The rat the cat bit squealed."

We want a result which splits this sequence structurally into meaningful parts:

(The rat the cat bit squealed)
        /                   \
(The rat the cat bit)      (squealed)
     /              \
(The rat)    (the cat bit)
  /     \       /       \
(The) (rat)  (the cat)  (bit)
             /      \
           (the)  (cat)

By the proposition of substitution pooling, the basis for this split will be generalization according to substitutions from columns which tend to occur in similar contexts (Step 1).

In this case "the rat" should get lots of substitutions from things which occur in similar contexts (other animals), and "squealed" get lots of substitutions from other actions (which will occur in similar contexts.)

So, say columns of the concatenated sequence "The rat the cat bit squealed" have columns with context cells for "rat". And searching over the concatenated representation for all sequences met until this point, those context cells (perhaps consolidated to the contexts they predict) should be similar to those for the columns of "mouse". Then we take the rest of the cells for mouse too, and overlay them on our concatenated sequence for "The rat the cat bit squealed".

Now we have context information for sequences with "mouse" overlaid on our sequence with "rat".

Some of those transition cells for mouse will likely be predictions of "squealed", or something else "squealed" has been generalized with.

If we do this with enough animals, I'm betting we will have a sequence with a very strong transitions between the columns of "rat" and the columns of "squealed" (because those have inherited transitions for all other animals and actions, too.)

Having now generalized, we proceed to Step 2), find the columns with the greatest number of transition cells between them and split the sequence between these columns. These should be the columns for "rat" and "squealed" in this case, This gives us the first level split between (The rat ...) and (squealed).

It's essential to the method that when we perform this split, we want the columns of "the cat bit", to come with those for "rat" (because conceptually they actually describe the "rat"). We might do that negatively by showing that the next strongest generalized transition is between columns contextually similar to "rat" and columns contextually similar to... "bit" (I'm guessing "rat" and "bit" as having a greater number of of transition cells than other pairs. E.g. we might be able to swap context cells from sequences like "the rat bit squealed", but we won't find *"the rat the cat squealed".) And then successively splitting on transition cells between "bit" and "the cat", "cat" and "the".

I think that works logically. We just have to figure how it might be implemented practically.

Finding columns which have similar context cells, and then overlaying those context cells seems straightforward enough (perhaps with the complication that we have to consolidate cells to the columns they predict.)

Splitting on the "strongest" transition within a (generalized) concatenated sequence might not be too hard either. It will depend on the details of the transition coding implementation.

floybix commented 9 years ago

I continue to be confused by your use of the words "context", "cell" and "column".

It is helpful to work through a concrete example, so thanks for that.

Let's not get too far ahead of ourselves. I'm uncomfortable with speculation about the behaviour of complex systems; we know how surprising they are. Especially when distributed state and feedback is involved.

To start with I would like to see if we can in fact reproduce a given sequence from its higher-level "concatenated" representation. My initial attempts have failed. Investigating, it seems that

the predictive feedback signal overwhelms the local sequence predictions, so it strongly predicts common or early-occurring words like "the" in all situations.
predictive feedback from the higher level causes any occurence of an input value to always choose the same single cell, effectively losing sequence context. The reason is the feedback context always has some common bits with that single cell, like "the", "a", or even just the sentence-starting signal.

A solution to this might be to somehow weight local lateral connections higher than feedback connections, or limit the influence of feedback.

floybix commented 9 years ago

Demonstrating what I said in my last comment, the failure to reproduce a sequence by feedback from its higher-level pooled context: http://viewer.gorilla-repl.org/view.html?source=gist&id=996b5f56ef3f2d39532f

robjfr commented 9 years ago

I'm not sure how you've cut it up. But by your results you are finding that words have multiple predictions. Which reflects the multiple contexts they are presented with in the data.

Though I don't see where "squealed" predicts "chased" in your data. Perhaps that is some column generalization occurring?

Actually, my guess, the whole problem seems to be there is too much generalization appearing from somewhere even at this early stage. With full context, each successive state should be completely specified.

Maybe you shouldn't reset context representations with each sentence. Ideally (for this problem) each transition should have a unique cell representation, and that cell representation should uniquely generate the next transition. If you never throw anything away and just keep driving the CLA through the input state after state, I can't see how that wouldn't happen.

And I see something like this reflected in your comments at the end of the Gorilla-REPL for this expt:

This suggests we need to

limit the influence of feedback - balance it with lateral transitions.
actually concatenate (union) states in the higher layer without decay, maybe?

Yeah, what you said. If you actually concatenate states in the higher level, you should be able to keep exact transitions, and just follow them through from one set of columns to another.

robjfr commented 9 years ago

Though actually to make this true you would need different cell states for every occurrence of a context.

So that's another comment on your proposal in the public thread (which I didn't notice was public till too late. Oops.) If you have a different cell state for each occurrence of a context then the cell state not only codes the context, but that context at that exact position in the sequence. So actually it codes much more than a single transition, it codes that transition in the context of the whole sequence to that point. Long distance contexts.

floybix commented 9 years ago

Though I don't see where "squealed" predicts "chased" in your data. Perhaps that is some column generalization occurring?

One of the training sequences was:

(">" "the" "mouse" "the" "cat" "chased" "ran" ".")

So that presentation of "chased" would have learned to be predicted by the higher-layer context bits for (">", "the") (as well as "mouse" "the" "cat").

The test sequence up to the point you are referring to is:

(">" "the" "rat" "the" "cat" "bit" "squealed")

Which shares some context (">" "the") (and possibly some bits of "cat") with the training sequence above, which is why "chased" is one of the predictions. As I said, the predictions are dominated by feedback rather than lateral connections.

If you actually concatenate states in the higher level, you should be able to keep exact transitions, and just follow them through from one set of columns to another.

Actually I don't think that's the problem.

btw on my gorilla page I mistakenly had listed :use-feedback? false in the spec but that was not what I used to generate the results, it was of course :use-feedback? true and I've now corrected the link.

robjfr commented 9 years ago

Right. OK thanks. I now see where the generalization is coming in. So that is extended context influencing predictions, but not enough extended context.

I agree(?), feedback learning of extended context is not helping us here. Or rather, it is not giving us enough extended context to specify that squealed occurs following (">", "the") etc. ONLY in the.... 4th sentence, or whatever order you present the sentences.

You would get this information if you just presented sentences seamlessly one after another, and had different cells for the same transition at different occurrences of that transition in the whole data stream. No?

floybix commented 9 years ago

I think the main problem is that the extended context is influencing predictions in a very smeared-out way, where all words seen in any part of that extended context are simultaneously and continuously predicted. So we need to strengthen the influence of lateral connections, which represent the immediate context.

A couple of reminders, in case they help:

predictions are caused by any sufficient subset of context (prior active cells in either layer) - not by a match to the whole context.
when a new sequence is presented, it will not get a unique cell representation indefinitely; each step will burst until a recognised transition (first-order transition) is seen - at that point the representation will collapse to the recognised one.

mrcslws commented 9 years ago

I'd think there are two reasons that this problem would occur:

As you mention, because of decay, using the SDR from the end of the sequence means the SDR will be weighted toward the more recent patterns in the sequence.
Because it's union pooling, the SDRs later in the sequence will be larger. Because distal connections in the lower region are selected randomly from the available active bits, they'll be skewed toward feedback connections as the SDRs grow.

Do we expect to be able to replay a sequence in the lower layer by holding the final SDR constant? I'd think we'd have to replay the changing higher SDR.

floybix commented 9 years ago

That's certainly what I was aiming for - to produce a single higher-level (temporal pooled) representation from which we could replay a sequence or at least the salient parts of it. Once we can do that we should be able to do things like rephrasing/reformulating, generalising, translating.

Yes the much larger number of cells in union pooling is the bias. I am thinking of splitting "apical" (feedback) bits from "distal" (lateral) bits, and maintaining separate synapse segment graphs for each. So each cell would have distal segments and apical segments, and I guess we would require that both are activated for a replay prediction.

mrcslws commented 9 years ago

Hmm. Maybe the apical learning (wow I just leveled up by saying that) should happen at the end of the sequence.

floybix commented 9 years ago

Do you mean at the end of a sequence, run back through all the activated cells in the lower layer and do their apical learning on the final union-pooled state in the higher layer? The idea being to capture following context as well as prior context?

mrcslws commented 9 years ago

Correct. (Sorry, I could have said that much more clearly.)

robjfr commented 9 years ago

I think the main problem is that the extended context is influencing predictions in a very smeared-out way, where all words seen in any part of that extended context are simultaneously and continuously predicted. So we need to strengthen the influence of lateral connections, which represent the immediate context.

A couple of reminders, in case they help:
predictions are caused by any sufficient subset of context (prior active cells in either layer) - not by a match to the whole context.

I think I understand that there is no "match on the whole context" explicitly in the processing. But if cells coding a transition are unique to a given occurrence of a transition, they will, in effect, code for the entire context of the string to that point. So the cells which code the transition "mouse" -> "squealed" at position X, if those cells are unique to position X, will code the transition "mouse" -> "squealed", but also the position for that transition at X, and implicitly all that came before X too.

when a new sequence is presented, it will not get a unique cell representation indefinitely; each step will burst until a recognised transition (first-order transition) is seen - at that point the representation will collapse to the recognised one.

I'm not sure how bursting will help us here. It may not matter, and it may come to the same thing, but for the purposes of argument at this point I would like to see "learning" happen immediately, so that on the first presentation of a new state it "learns" cells which predict it, and we can reproduce it by following those cells. That's how I'm thinking about it at this point anyway.

Also, a general comment on your feedback mechanism. You implemented this to supply longer context to individual states in the CLA. I wonder if explicitly exporting this information to the CLA is necessary at this point though. I was thinking it would be necessary originally because I thought states needed more context information so we could generalize, and I thought of that information as being associated with the states in the CLA. But now we see all this information is in the concatenated state anyway. I see everything happening in the concatenated state. I don't even see a need to export it to the CLA at this point. The CLA can just code input states and the transitions between them. All the interesting stuff will happen in the concatenated state.

And as a particular issue for the problems we are seeing now, I think the solution is to, yes, not have any fade in the representation of the concatenated layer. And also to have different cells for each new occurrence of each transition, so that not only the transition, but its ordinal place in the original stream of data, will be preserved.

mrcslws commented 9 years ago

I'll try to answer one part. Low-hanging fruit, as they say.

I'm not sure how bursting will help us here. It may not matter, and it may come to the same thing, but for the purposes of argument at this point I would like to see "learning" happen immediately, so that on the first presentation of a new state it "learns" cells which predict it, and we can reproduce it by following those cells. That's how I'm thinking about it at this point anyway.

Felix is talking about the lower layer. It bursts because the sequence is novel. Eventually it stumbles on a recognized transition. This transition may have been learned from the recent series of bursts. Regardless, the recognized transition causes the higher layer's SDR to reset and be completely determined by feedforward connections from lower layer's SDR (which was mostly predicted), and it begins union pooling from that point.

floybix commented 9 years ago

Regardless, the recognized transition causes the higher layer's SDR to reset and be completely determined by feedforward connections from lower layer's SDR (which was mostly predicted), and it begins union pooling from that point.

Oh, no, that's not what I meant. In fact for these experiments (using the private context-space branch) there is no resetting of the union pooling by bursting. Only manual resetting.

I meant -- in the lower layer -- that we can continue to assign unique cell representations (choices of cells in columns) while the sequence is bursting=unrecognised, but as soon as a transition is recognised then the cell representation is not unique to that sequence.

Rob I think you are suggesting that we should not burst at all, but continue to ingest everything literally as being completely new, without trying to recognise previously learned sequences. But I don't really get how that could work.

I see everything happening in the concatenated state. I don't even see a need to export it to the CLA at this point. The CLA can just code input states and the transitions between them.

By "export to the CLA" I take it you mean "expose to the lower layer for its prediction and learning". But we need the lower layer to learn connections from the extended context, otherwise it can not be entrained to replay sequences from above.

mrcslws commented 9 years ago

Sorry, yeah, I forgot that you removed the newly-engaged? part. Forget the part where I said it gets reset.

robjfr commented 9 years ago

...we can continue to assign unique cell representations (choices of cells in columns) while the sequence is bursting=unrecognised, but as soon as a transition is recognised then the cell representation is not unique to that sequence.

Yeah, you're right Felix. I just came to this realization while composing a reply to Marcus: "...as soon as a transition is recognised then the cell representation is not unique to that sequence."

I've realized I'm talking about a "super-sensitive" CLA which bursts at every time step. Every state is new to it.

(Also I want "learning" to be instant, so even on first presentation you can recover the state from an instantly learned cell transition. I don't know if that is realistic in the context of current processing. I don't see why not as a first approximation.)

Rob I think you are suggesting that we should not burst at all, but continue to ingest everything literally as being completely new, without trying to recognise previously learned sequences. But I don't really get how that could work.

What I just realized above is that actually I want it to burst every time.

What we're doing while trying to reproduce the original sequence, is essentially regressing and removing ALL generalization from the CLA. The current CLA generalizes on state. But we don't want generalization. Not for this first problem of "recall" which you presented today. We want perfect recall.

To get that we need to up CLA sensitivity to the max, and burst on each time step.

By "export to the CLA" I take it you mean "expose to the lower layer for its prediction and learning". But we need the lower layer to learn connections from the extended context, otherwise it can not be entrained to replay sequences from above.

Oh, I see what role you've giving the CLA here. I wasn't thinking of the replay as occurring anywhere. But yes, it makes sense you want it to replay into the CLA.

OK, express replay in the CLA.

But for this "perfect recall" problem, we don't need any explicit "extended context" processing. That will just generalize and mess up our perfect recall. Instead burst at every time step on encoding, and then just produce the current (unique) transition on replay, one transition, no "extended context".

(Note: I don't think there's ever going to be any "extended context" processing now, actually, not in the sense of pushing context into states in the CLA "bottom up", so to speak, building hierarchies from the bottom. That is too hard. Instead we'll unfold "states in context" from the full concatenated state, "top down." Successively breaking a concatenated state into halves on the "strongest" transition. But that's not what we are talking about here. You wanted to forget all generalization and just perform a perfect recall. To get your bearings. That should happen as above, with bursting at every time step, and no generalization, particularly no "extended context" generalization, at all.)

robjfr commented 9 years ago

Still dithering on the split criterion when we build this top down tree.

It seems a lot to base a judgement about the entire meaning of a sequence on one transition. This struck me most forcefully considering the extension to continuous states with speech. Then the entire split would need to be one instantaneous sound transition.

So I'm reconsidering making the split on some kind of sum again.

Maybe it should be bottom up to the extent we average cells per column following paths along line of weakest resistance... So not averaging cells per column along the timeline, but along line of (reverse) transition diversity.

Since we only have columns, we have to add from the bottom, column to column. Perhaps the addition should be backwards, not along transitions, but back, least diverse transitions first. Then select the final, most (least least) diverse transition at the top.

floybix commented 9 years ago

Tantalisingly close: http://viewer.gorilla-repl.org/view.html?source=gist&id=996b5f56ef3f2d39532f

robjfr commented 9 years ago

Good that you're making progress.

In common with the public discussion, I'm not sure why you are resetting between each sentence.

Anyway, the key to perfect recall, to the extent that is desirable as a goal (and note this is not something humans perform well), seems to me to be to make column representations absurdly sensitive, so they distinguish each unique occurrence of a word, distinguishing it anew each time it is used, and thus also distinguishing the next word predicted by it.

So I would say, to get unique recall, throw in a couple of bits to distinguish each use of a word in each sentence, and between sentences. You get to code the word. There is no reason why you can't give it a few extra bits for each occurrence. In the real world there would be lots of extra bits floating around. The normal challenge is to filter them out and keep the commonality.

Or, perhaps, it is more meaningful to engage the concatenated representation already and distinguish the column representation for each occurrence of a word on the entire previous sequence.

Actually, for this task, you could make the column representation for each new occurrence of a word be the entire previous sequence. However that would make going back to generalization harder. The real difficulty is always not how to make stuff specific, but how to generalize about it.

If you just add a few bits from the preceding sequence that should be enough. Then you can distinguish on it and get perfect recall, or throw it away and go back to generalizing about a word independently of its context, when, and as much, as you want.

robjfr commented 9 years ago

In fact this representation for each occurrence of a word identified with the entire preceding sequence may be what you are getting with your apical excitation.

Though for the concatenated sequence to distinguish predictions better than each sequential state alone, you'd need to have cell predictions (segments?) from the entire state of the concatenated sequence (?? I think. Sound reasonable?) So the concatenated cell state would not just be a concatenation of cell predictions for individual states, but would have predictions about the next state growing from all the columns of the concatenated sequence so far, giving the prediction far more context.

The real issue is how this is going to affect generalization, next up. As I say, humans don't do perfect recall well, so there may be a trade-off on the generalization side which is more important.

robjfr commented 9 years ago

Actually @floybix this should all come to the same thing. What I described above looks like one or other form of your "extended context", whether kept in the concatenated representation, or fed down to the CLA. Is that what is effectively being used to make more and more precise predictions using "apical" excitation?

floybix commented 9 years ago

I don't have time to work on this today but it sounds like you have understood the apical feedback. I think it will work if I tune the parameters.

On Saturday, 31 October 2015, Rob Freeman notifications@github.com wrote:

Actually @floybix https://github.com/floybix this should all come to the same thing. What I described above looks like one or other form of your "extended context", whether kept in the concatenated representation, or fed down to the CLA. Is that what is effectively being used to make more and more precise predictions using "apical" excitation?

— Reply to this email directly or view it on GitHub https://github.com/floybix/comportex-private/issues/1#issuecomment-152696284 .

Felix Andrews / 安福立 http://www.neurofractal.org/felix/

robjfr commented 9 years ago

Yes, I think I understand the value of your apical feedback now.

I rejected it above because it was causing over generalization, only to reintroduce it earlier today to generate greater specificity! I now see it can do either, its a matter of tuning to make sure the information is used in the right way.

I now think your message from 5 days ago captures the problem well.

I think the main problem is that the extended context is influencing predictions in a very smeared-out way, where all words seen in any part of that extended context are simultaneously and continuously predicted. So we need to strengthen the influence of lateral connections, which represent the immediate context.

A couple of reminders, in case they help:

predictions are caused by any sufficient subset of context (prior active cells in either layer) - not by a match to the whole context. ...

Your idea of strengthening the influence of lateral connections might work.

Though conceptually I now see strengthening lateral connections (or ignoring extended contexts altogether which was what I suggested) as removing information. Instead what I came back to is that we need more context information, not less. We just want to prevent one part of that context from dominating.

So if you can't get it to work by strengthening lateral connections, you might try going the other way, away from a dominance of any one part, even the closest, and forcing something more like a match on the whole context.

The problem is a dominant match caused by just one part of the context. Instead of trying to fix that by favouring the closest (lateral) context, you might go the other way and make the sensitivity of the prediction such that a larger subset (percentage?) of prior active cells is required, so no single part of the context can dominate.

floybix commented 9 years ago

I've followed up my approach of prioritising lateral distal connections. This is where I got up to: http://viewer.gorilla-repl.org/view.html?source=gist&id=95da4401dc7293e02df3

I''m still confused by the stimulus threshold issue at the end.

Rob, I don't fully understand what you are proposing. I did try increasing the stimulus and learning thresholds which provides a more complete match to context (both lateral and apical) but it didn't seem to make much difference in my tests. I suspect the problem is actually in the sequence learning within one layer, and how representative (winner) cells are picked there. So your idea of a "super-sensitive" CLA is interesting. I assume you mean picking a random set of winner cells on every exposure. How do you recover those cells at replay time though? It must be from a match to the pooled state by apical feedback.

robjfr commented 9 years ago

Rob, I don't fully understand what you are proposing. I did try increasing the stimulus and learning thresholds which provides a more complete match to context (both lateral and apical) but it didn't seem to make much difference in my tests.

I'm not sure if you are still doing resets between training sentences??

If you are, then the broader context will be lost (I guess), and requiring a more complete match won't make any difference, because there will be no broader context to force a match to.

If forcing a more complete match isn't making any difference, this might be the reason.

How about trying to leave out the resets between test sentences?

I might be wrong. I'm still not sure what the effect of resets is.

Whatever, you just need to find a way to have prior context leave a shadow on each new occurrence of a word, to distinguish it from all previous and following occurrences.

I suspect the problem is actually in the sequence learning within one layer, and how representative (winner) cells are picked there. So your idea of a "super-sensitive" CLA is interesting. I assume you mean picking a random set of winner cells on every exposure. How do you recover those cells at replay time though? It must be from a match to the pooled state by apical feedback.

Not so much random, I think, as relevant only to that position in sequence. To get perfect recall you just need something, anything, to distinguish one occurrence of a word from another. It must be reproducible though, so you're right it couldn't just be microphone noise.

I still think just using the context of prior sentences should do it. But (as I understand it) you'll need to not reset between training sentences, so the system knows that in the sentence which comes after "The mouse the chased ran", the word after "cat" is "bit".

You need to have the preceding sentence leave a shadow on your representations, so the system can distinguish which sentence in sequence it has got up to.

floybix commented 9 years ago

To illustrate my point about the problems with sequence learning, I added this comment to the Go section:

Check that out. There were several training sentences including "bit" but we see only two different cells were ever activated (since only two cells have grown distal segments). So despite the differences between

"> the bird the cat bit"

"> the cat the bird bit"

"> the mouse the cat bit"

"> the rat the cat bit"

only two of these occurrences had distinguishable representations. The apical feedback representations from each complete sentence were distinguishable (note 4 apical segments) but that is deferred so did not affect the original cell choice.

I think this is just the CLA bursting behaviour - novel sequences burst until recognised; "the cat" was recognised so at that point in each sequence we had collapsed to an identical cell SDR.

Remembering that my "training" was presenting each sentence just once.

About leaving out resets. They are actually two different things - breaking the sequence transition, and clearing pooled cells.

If we don't clear the pooled cells, then we would always be at the maximum level of active cells in the pooled layer, and the results would depend on how we turn off old pooled cells. Maybe it could be a time limit, or the current approach of a small decay coupled with a competitive selection. But currently we don't have a good story for how to do this. If we start fresh in each sentence then the pooled cells are built up without such issues, until we reach 5 or 10 steps or whatever the parameter is.

As for continuing sequence transitions across sentence boundaries. Maybe, but that does not mean that we will necessarily get distinct cell representations: see the first part of this comment.

Another possibility I just thought of. Observe each sentence in two passes.

The first pass is forced to be all bursting in the lower layer, so all cells in the columns are activated. A higher layer pools these.
- The pooled layer is then like a hash of the content of the sentence, without any context encoding.
In the second pass, do choose cells in each column of the lower layer; use the pooled feedback via apical segments to weight the selection.

robjfr commented 9 years ago

To illustrate my point about the problems with sequence learning, I added this comment to the Go section:

Check that out. There were several training sentences including "bit" but we see only two different cells were ever activated (since only two cells have grown distal segments). So despite the differences between

    "> the bird the cat bit"
    "> the cat the bird bit"
    "> the mouse the cat bit"
    "> the rat the cat bit"

only two of these occurrences had distinguishable representations. The apical feedback representations from each complete sentence were distinguishable (note 4 apical segments) but that is deferred so did not affect the original cell choice.

I'm probably not understanding something. (Lots! But this is another thing!)

Apical bits come from the concatenated representation, right? But by the time we get to any given "bit", the concatenated representation must be wildly different not only in cells, but also in columns.

Perhaps this is it. You are talking about "bit" only getting two cells different. So you're saying only the cells of "bit" govern predictions. But I'm thinking of the wildly differing cells (and columns) sourcing apical feedback from the concatenated layer. I'm mostly thinking of predictions sourced from the concatenated representation, not predictions sourced from the CLA.

The way I understand it, predictions about what follows "bit" will be governed not just by the SDR for "bit" but apical feedback from the whole concatenated representation at "bit".

About leaving out resets. They are actually two different things - breaking the sequence transition, and clearing pooled cells.

If we don't clear the pooled cells, then we would always be at the maximum level of active cells in the pooled layer, and the results would depend on how we turn off old pooled cells. Maybe it could be a time limit, or the current approach of a small decay coupled with a competitive selection. But currently we don't have a good story for how to do this. If we start fresh in each sentence then the pooled cells are built up without such issues, until we reach 5 or 10 steps or whatever the parameter is.

Are our columns just too short? If sparse cells are not able to hold new information without discarding old, the alternative is to find a more efficient way to store the information.

But the easy way seems to be to make the columns taller.

As for continuing sequence transitions across sentence boundaries. Maybe, but that does not mean that we will necessarily get distinct cell representations: see the first part of this comment.

As above. There must be a difference in the way you see the concatenated representation and the way I see it. The way I see it, the concatenated layer will have widely differing columns by the time we get to each occurrence of "bit". And apical(?) feedback should provide all the information we need to distinguish predictions.

Another possibility I just thought of. Observe each sentence in two passes.

The first pass is forced to be all bursting in the lower layer, so all cells in the columns are activated. A higher layer pools these.
    The pooled layer is then like a hash of the content of the sentence, without any context encoding.
In the second pass, do choose cells in each column of the lower layer; use the pooled feedback via apical segments to weight the selection.

Yes, that sounds more like what I was originally thinking. Allow the states to have unique (hash) representations (though I hate the idea of a separate "pass" to assign them.)

But I've gone off that idea. I think we can get all the uniqueness we need from the preceding sequence.

It sounds to me like you are:

1) Throwing away most of the information from the preceding sequence because you don't have enough cells to code for it.

2) Not using information from the preceding sequence (in the concatenated representation) as the primary source of predictions about what comes next. Instead you are trying to squeeze all that information into the CLA state??

floybix commented 9 years ago

By the way, I'd like to present this example of replaying a sequence from its pooled layer as a short talk at the Numenta community meetup. I think you'll agree nothing there is specific to your ideas on meaning, generalisation etc? Let me know if you have concerns.

robjfr commented 9 years ago

I do think concatenating everything in one SDR is crucial to what we want to do. And personally it was something of a epiphany for me that we could represent a sequence of states like that. It exposes the contexts as a parameter for generalization on an equal basis with the columns, and I realized we could get a context-based non-linear rearrangement of states, using it.

But I don't think we're giving too much away by presenting it. Numenta may have stumbled to something similar in their own temporal pooling efforts. I wasn't sure whether that was how they were coding sequences anyway (still not, not sure how they code sequences!)

And thinking about it, both our simultaneous realizations that we would want to code sequences this way, my post and your reply to say you were working on something similar you called "context spaces", is in the open thread anyway!

If someone sees it, immediately groks the power it gives us, and goes on to write a generalization over contexts which pools based on substitution the way we are trying to do. I'll be delighted to see it implemented!

The real question is what you present of perfect recall. Whether you agree with my points above to fix it.

And moving on from that, I hope it won't distract too much from the generalization task. I think perfect recall is a degenerate task, and not something we do well cognitively, anyway. It might fit in the category of underwhelming them. But that won't matter, if you want something to present, sure, why not present this.

floybix commented 9 years ago

Some quick points on the run...

But by the time we get to any given "bit", the concatenated representation must be wildly different not only in cells, but also in columns. I'm mostly thinking of predictions sourced from the concatenated representation, not predictions sourced from the CLA.

In my section From smeared-out to orderly I explained why I am treating predictions as primarily from lateral connections, not apical connections. However I'm reconsidering:

New idea - cells are only predictive if both lateral and apical segments are activated. The result is the continuous bursting (thus unique representations) on novel sequences I proposed above, as long as the apical context doesn't match a known one. But as soon as the context is matched we get recognition. The second pass would be optional, to clarify (refine) the pooled representation if it was initially bursting.

Are our columns just too short? If sparse cells are not able to hold new information without discarding old, the alternative is to find a more efficient way to store the information.

Well there has to be a limit at some point if you just keep adding cells every time step. And we need the sparsity properties. Note in my experiment I had the higher layer with only 1 cell per column because I was thinking of it just as a reflection of the lower cells, I wasn't looking at sequence transitions at the higher level.

robjfr commented 9 years ago

Some quick points on the run...
But by the time we get to any given "bit", the concatenated representation must be wildly different not only in cells, but also in columns.
I'm mostly thinking of predictions sourced from the concatenated representation, not predictions sourced from the CLA.
In my section From smeared-out to orderly I explained why I am treating predictions as primarily from lateral connections, not apical connections.

I think this retreat from "smearing" was my first reaction to broader contexts predicting wildly too. I suggested you forget the extended context and make a unique state by bursting each time.

Then I decided I did want extended context, because it would be the easiest way to get unique, but reproducible bits at each time step, and thought to solve the "smearing" problem by requiring a match over a really broad range of that context, to prevent any one part from making maverick predictions.

You said (somewhere?) you tried that and it didn't work(??)

I'm still thinking that can work. Perhaps what I see as the key bottleneck now is this throwing of context information away, especially between sentences, because you don't want the representations to get too dense.

However I'm reconsidering:

New idea - cells are only predictive if both lateral and apical segments are activated. The result is the continuous bursting (thus unique representations) on novel sequences I proposed above, as long as the apical context doesn't match a known one. But as soon as the context is matched we get recognition. The second pass would be optional, to clarify (refine) the pooled representation if it was initially bursting.

Too much complexity and talk of multiple passes makes me very uncomfortable. If we have to force any solution too much we are probably going in the wrong direction.

However, this does underline that the key problem is to find sufficient distinguishing bits at each time step.

Are our columns just too short? If sparse cells are not able to hold new information without discarding old, the alternative is to find a more efficient way to store the information.
Well there has to be a limit at some point if you just keep adding cells every time step. And we need the sparsity properties. Note in my experiment I had the higher layer with only 1 cell per column because I was thinking of it just as a reflection of the lower cells, I wasn't looking at sequence transitions at the higher level.

Well, you know I want to do almost everything at that higher (concatenated) level. I think a very comprehensive concatenated representation is going to be necessary for generalization. So if we're meeting problems of information density now, we are certainly going to meet them later.

I don't know what the engineering limits are, but I think before we're finished we're going to need to have something like thousands of states sitting together with their cells, in an SDR at the same time. I don't see why it should be a problem in principle. As we increase the number of columns and cells we should get an exponential increase in information capacity (at least?), so we should be able to deal with any length problem, which will only add linear complexity as we try to store the information from ever longer sequences.

Given the capacity to store longer sequences, conceptually I can't see why a really comprehensive concatenated representation, together with the requirement that predictive connections be spread over a really wide range of that concatenated representation, to prevent maverick predictions from individual states within it, wouldn't work.

floybix commented 8 years ago

I'm planning to shut down my paid github account which means this private repo will be deleted. If you want to keep stuff from this thread, please copy it now.

robjfr commented 8 years ago

That seems a pity Felix. To me this is a work in progress and many of the insights here still constructive. Is deletion the only option? How would you feel about moving the repo to another account, private or public?

floybix commented 8 years ago

Ok, I made it public for now and renamed to comportex-archived.