High level commands - Githubissues

After some research I have decided to try implement Q learning within the thalamus class. The idea is that the top levels output is sent to the thalamus and it assignes Q values to each of the input cell grid squares. Then normal Q learning is performed and an output is selected by the thalamus and sent back to the HTM as a top level feedback command.

Here is a post on one way of combining to the two https://cireneikual.wordpress.com/2015/01/08/continuous-htm-multiple-layers-and-reinforcement-learning/

I think a better solution is to not use a feedforward neural network and just use the output of the HTM. Here is a post on the email discussion about q learning and the HTM.

Hi Eric

Gideon (also on the list) and I have been working on this for a while. We are very keen on assigning Q values to each HTM cell. This seems to work really well. However, in practice we have faced the following difficulties with making the idea work properly as a complete agent:

A deep hierarchy is needed to create long-term, abstract concepts to which we can assign meaningful Q values. This means temporal pooling and hierarchical learning must be working really well. At the moment it seems hierarchically-scalable temporal pooling is a Work In Progress for HTM-like algorithms. If we can't create a deep hierarchy, we can't link causes that occur a long time before Rewards, except by discounting (where the signal rapidly becomes weak in a "flat" hierarchy, due to the large number of intermediate states).
If you have hierarchical Q-values, you will want hierarchical action selection. If you have hierarchical action selection, you need to be able to execute actions hierarchically. This poses a number of problems, such as maintaining the agency of actions represented at higher levels of the hierarchy. (see http://a-mpf.blogspot.com.au/2014/12/agency-and-hierarchical-action-selection.html )
"Closing the loop" and allowing the agent's actions to determine future inputs, changes the dynamics of the system, and can lead to runaway feedback effects. For example, say the agent discovers a mildly adaptive action. Does it endlessly repeat that strategy, or keep exploring the space to discover better actions? This exploration-exploitation balance is a well known and unsolved problem ( http://en.wikipedia.org/wiki/Multi-armed_bandit ). Of course, the dilemma applies to organisations and society as well ( http://vserver1.cscs.lsa.umich.edu/~pjlamber/Complexity%20Course_files/exploration_exploitation.pdf ). By definition there is no perfect solution to this problem. Humans are pretty good at it most of the time.

regards

calumroy / HTM

High level commands #8