Reward prediction error instead of reward

Huizerd commented 5 years ago

As suggested by Fremaux et al. (2010), learning performance of reward-modulated STDP can be greatly enhanced by using 'reward prediction error' instead of actual reward in the computation of the new weights (w += gamma * dt * reward * e_trace). I tested this for a simple 1D navigation task, with the following result for non-RPE:

final_reward_plot

And for RPE:

final_reward_plot

Where blue is the predicted reward for an episode, and red the actual obtained one. The maximum reward that could've been achieved in one episode is 10000, so the RPE-variant got very close.

I implemented the predicted reward as a simple moving average of past episodes (per step, so episode_reward / steps), and fed this to the Network.run. I used Pipeline, which I had to modify in order to be able to modify reward between getting it back from the Gym environment and feeding it to Network.run.

I would like to implement this in BindsNET as well, so my questions are: have you heard of this? Has anyone tried to implement it yet, and how? Do you have a preferred way of implementing?

I think it would be cool if someone can decide for their own how they want to implement the predicted reward. So we might need the ability to pass a function to Pipeline / Network. This might also cover the reward features discussed in #217.

djsaunde commented 5 years ago

I implemented the predicted reward as a simple moving average of past episodes (per step, so episode_reward / steps), and fed this to the Network.run. I used Pipeline, which I had to modify in order to be able to modify reward between getting it back from the Gym environment and feeding it to Network.run.

I think it would be good to have. I'm envisioning a new module (perhaps bindsnet.reward or bindsnet.learning.reward) that contains functions to this effect.

Are there other possible implementations of predicted reward? A moving average seems a bit simplistic, whereas you might want the network itself to predict the reward by, e.g., predicting it as a function of the history of its inputs.

On another note, there's no requirement to use Pipeline for RL experiments. We typically find it too restrictive. We implemented it due to a request from our funding agency at the time.

I would like to implement this in BindsNET as well, so my questions are: have you heard of this? Has anyone tried to implement it yet, and how? Do you have a preferred way of implementing? I think it would be cool if someone can decide for their own how they want to implement the predicted reward. So we might need the ability to pass a function to Pipeline / Network. This might also cover the reward features discussed in #217.

I've heard of it, but we haven't tried to implement it. I agree it'd be good to optionally pass in a constant or a callable (a function, or a class which defines __call__) to Network.run, which can be referenced / called on each timestep. By default (no reward argument), it would be None, and reward-modulated LearningRules would throw an error. Or, it might be better implemented as a method / attribute / @property of the Network object; e.g., Network.reward would either reference a constant or invoke a callable object. This is something that might require some debate.

djsaunde commented 5 years ago

By the way, the plots you shared are really cool! Are they from experiments similar to those in the Fremaux paper?

Hananel-Hazan commented 5 years ago

I would like to implement this in BindsNET as well, so my questions are: have you heard of this? Has anyone tried to implement it yet, and how? Do you have a preferred way of implementing? I think it would be cool if someone can decide for their own how they want to implement the predicted reward. So we might need the ability to pass a function to Pipeline / Network. This might also cover the reward features discussed in #217.

That could be great addition to the BindsNET framework. There is plans to continue in this direction, but you can be the first to add it. We may change it in the end.

Huizerd commented 5 years ago

Ok, I have a non-working implementation on this branch: https://github.com/Huizerd/bindsnet/tree/reward.

I'm having some trouble with the use of (abstract) classes, and I hope you can see what's going wrong. For some reason, when self.reward.compute is called here https://github.com/Huizerd/bindsnet/blob/d9913a6e105ac948eb97b380c367eb907c3d687c/bindsnet/network/__init__.py#L268

it overwrites self with reward, so when inside the function https://github.com/Huizerd/bindsnet/blob/d9913a6e105ac948eb97b380c367eb907c3d687c/bindsnet/learning/reward.py#L31

it throws the error AttributeError: 'Tensor' object has no attribute 'reward_predict'. Any idea what could cause it to possibly overwrite self? Also, my interpreter gives a warning when passing MovingAvgRPE is not of type AbstractPredictedReward, even though it clearly inherits it. Maybe that has something to do with it. Thanks in advance!

djsaunde commented 5 years ago

@Huizerd I'll look into this tonight.

djsaunde commented 5 years ago

@Huizerd do you have a script that demonstrates the error?

Huizerd commented 5 years ago

I have a minimal example here:

import numpy as np
import torch

from bindsnet.network import Network
from bindsnet.network.nodes import LIFNodes, Input
from bindsnet.network.topology import Connection
from bindsnet.learning import MSTDPET
from bindsnet.learning.reward import MovingAvgRPE
from bindsnet.encoding import poisson

# Seed
seed = 0
torch.manual_seed(seed)
np.random.seed(seed)

# Build
net_lif = Network(dt=1.0, reward=MovingAvgRPE)
net_lif.add_layer(name='Input', layer=Input(5))
net_lif.add_layer(name='LIF', layer=LIFNodes(1))
net_lif.add_connection(
    connection=Connection(source=net_lif.layers['Input'], target=net_lif.layers['LIF'], update_rule=MSTDPET),
    source='Input', target='LIF')

# Start
spikes = poisson(torch.ones(5) * 40, time=1, dt=1.0)
net_lif.run(inpts={'Input': spikes[:, None, :]}, time=1, reward=10.0)

which resulted in AttributeError: 'float' object has no attribute 'reward_predict'.

djsaunde commented 5 years ago

Okay, finally got around to looking at this. Looking at your Network constructor:

    def __init__(self, dt: float = 1.0, learning: bool = True, reward: AbstractPredictedReward = None) -> None:
        # language=rst
        """
        Initializes network object.

        :param dt: Simulation timestep.
        :param learning: Whether to allow connection updates. True by default.
        :param reward: Class allowing for modification of reward in case of reward-modulated learning.
        """
        self.dt = dt
        self.layers = {}
        self.connections = {}
        self.monitors = {}
        self.learning = learning
        self.reward = reward

It expects reward to have type AbstractPredictedReward. However, you were passing in just the class MovingAvgRPE, not an instance of it: MovingAvgRPE(). And, it's never instantiated in the Network logic or before calling Network.run(). When I tried to instantiate MovingAvgRPE, I got an error because sub-classes must implement all abstract methods of their parent abstract class.

Try this:

import numpy as np
import torch

from bindsnet.network import Network
from bindsnet.network.nodes import LIFNodes, Input
from bindsnet.network.topology import Connection
from bindsnet.learning import MSTDPET
from bindsnet.learning.reward import MovingAvgRPE
from bindsnet.encoding import poisson

# Seed
seed = 0
torch.manual_seed(seed)
np.random.seed(seed)

# Build
net_lif = Network(dt=1.0, reward=MovingAvgRPE())
net_lif.add_layer(name='Input', layer=Input(5))
net_lif.add_layer(name='LIF', layer=LIFNodes(1))
net_lif.add_connection(
    connection=Connection(source=net_lif.layers['Input'], target=net_lif.layers['LIF'], update_rule=MSTDPET),
    source='Input', target='LIF')

# Start
spikes = poisson(torch.ones(5) * 40, time=1, dt=1.0)
net_lif.run(inpts={'Input': spikes[:, None, :]}, time=1, reward=10.0)

And uncomment the compute method of MovingAvgRPE, and change the following in Network.run:

# Compute reward prediction error.
        if self.reward is not None:
            try:
                reward = kwargs['reward']
                del kwargs['reward']
                kwargs['reward'] = self.reward.compute(reward, **kwargs)
            except KeyError:
                raise KeyError('Reward should be specified!')

Also, I recommend changing Network.reward to Network.reward_fn, to disambiguate between the per-timestep reward.

BindsNET / bindsnet

Reward prediction error instead of reward #224