Explore the training procedure and eventually define the edits needed

gcroci2 commented 8 months ago

As a reference for the code, see PR #16

Data in each epoch

trials = task.generate_trials() is always the same across one network's training (rng is fixed once for each network), being the entire training set. It's used thousands of times (epochs) to train the net. Are we fine with that? If yes, there is no point in generating the same trials for each epoch every time.

What do we think it makes more sense to do here?

Mini Batch vs Batch Gradient Descent

Right now, every epoch is trained with very few samples (< 100, minibatch_size). Usually, we need many more batches to train during 1 complete epoch. Shouldn't we increase the number of samples used for 1 epoch?
Here we're using a Batch Gradient Descent (not mini-batch as the var name would suggest). In Batch Gradient Descent, all the training data is taken into consideration to take a single step. We take the average of the gradients of all the training examples and then use that mean gradient to update our parameters - default reduction of nn.MSELoss. So that’s just one step of gradient descent in one epoch. Is this a correct way of training from a ML point of view?
Mini Batch Gradient Descent: in DeepRank2, we define a batch_size for our 10k or whatever dataset, and we pass in all the batches within one epoch, averaging the losses across all batches. The ref paper for Thomas' work (Training Excitatory-Inhibitory Recurrent Neural Networks for Cognitive Tasks: A Simple and Flexible Framework) is using Mini Batch Gradient Descent, even if it's not clear to me how from the code (GitHub - frsong/pycog: Train excitatory-inhibitory recurrent neural networks for cognitive tasks.). Does it make sense to implement something like this instead of what is in place now? Ask Shirin and Thomas for information about the choices made so far.

Validation

Any particular reason why validation starts after epoch 200? I think we should properly implement some early stopping callback.

Useful notes Batch size: defines the number of samples to work through before updating the internal model parameters. Number of epochs: defines the number of times that the learning algorithm will work through the entire training dataset. Batch, Mini Batch & Stochastic Gradient Descent

This issue blocks #15

We decided to start with the reinforcement part first, in particular by trying to mimic this article procedure: https://elifesciences.org/articles/21492

gcroci2 commented 4 months ago

Useful links

Rationale

A major goal in neuroscience is to understand the relationship between an animal’s behavior and how this is encoded in the brain. Typical experiment: training an animal to perform a task and recording the activity of its neurons while the animal carries out the task.

To complement these experimental results, researchers “train” artificial neural networks to simulate the same tasks on a computer. Unlike real brains, artificial neural networks provide complete access to the “neural circuits” responsible for a behavior, offering a way to study and manipulate the behavior in the circuit.

You can use:

Supervised learning: the network is given the correct response on each trial in the form of a continuous target output to be followed.
Reinforcement learning: it provides evaluative feedback to the network on whether each selected action was 'good' or 'bad.'

Reward-based training of RNNs

Song et al.'s networks consisted of two parts:

A “decision network” that uses sensory information to select actions that lead to the greatest reward (Actor).
- REINFORCE algorithm applied to RNNs; the idea is that a RNN can learn a behavioral policy for choosing actions at each time to maximize its cumulative reward.
A “value network” uses the selected actions to predict how rewarding an action will be and guide learning (Critic).

Other info:

Finite-horizon undiscounted return (the sum of rewards obtained in a fixed window of steps).
Episodic task (we have a starting point and an ending point - a terminal state).

The environment $\epsilon$ represents the experimentalist, while the agent $A$ represents the animal. At each time t the agent chooses to perform actions after observing inputs provided by the environment, and the probability of choosing actions is given by the agent’s policy $\pi_{\theta}$ with parameters $\theta$. Here the policy is implemented as the output of an RNN, so that $\theta$ comprises the connection weights, biases, and initial state of the decision network.

In this work they only consider cases where the agent chooses one out of $Na$ possible actions at each time, so that $\pi{\theta} (at | u{1:t})$ for each t is a discrete, normalized probability distribution over the possible actions $a1, … , a{N_a}$.

After each set of actions by the agent at time t the environment provides a reward (or special observable) $\varrho_{t+1}$ at time t+1, which the agent attempts to maximize.

gcroci2 commented 1 month ago

We've decided to implement the functionalities originally planned for the annubes package within the NeuroGym package instead, which represents the current state-of-the-art.

ANNUBS / annubes