Bootstrapped value might be out of date

alpine-chamois commented 4 months ago

In https://github.com/alpine-chamois/actor-critic/blob/main/src/actorcritic/actor_critic_agent.py train, the predicted_next_value looks like it uses an out of date value from before the latest action was taken (i.e. a value that has already been taken into account in the values for optimisation):

            while not terminated and not truncated:

                # Run the agent
                policy, value = model(torch.from_numpy(observation).float())

                # Determine the action
                logits = policy.view(-1)
                action_distribution = torch.distributions.Categorical(logits=logits)
                action = action_distribution.sample()

                # Store the value and action probability
                predicted_values.append(value)
                log_action_probability = policy.view(-1)[action]
                log_action_probabilities.append(log_action_probability)

                # Calculate and store the entropy
                probability = torch.exp(log_action_probability)
                entropy = -(log_action_probability * probability).sum()
                entropies.append(entropy)

                # Perform the action on the environment, receive a reward and observe the new state
                observation, reward, terminated, truncated, info = environment.step(action.detach().numpy())

                # Reward shaping
                reward = self.shape_reward(observation, reward, terminated, truncated)

                # Increment the step and cumulative reward
                step += 1
                cumulative_reward += reward

                # Store the reward
                rewards.append(reward)

                # Optimise after N steps or at the end of the episode
                if step == self.n_steps or terminated or truncated:

                    # Record the predicted value of the next state if not terminated
                    if not terminated:
                        predicted_next_value = value.detach()  # Detach to stop double backpropagation

                    # Optimise
                    actor_loss, critic_loss, entropy_loss = self.optimise(model, optimizer, log_action_probabilities,
                                                                          rewards,
                                                                          predicted_values, predicted_next_value,
                                                                          entropies)

Suggest checking DRL in Action code to see how they avoided this, and check that log_action_probabilities, rewards, predicted_values, and entropies are aligned.

alpine-chamois commented 4 months ago

In the code above, the bootstrapped value is used thus:

It's returned from the model: policy, value = model(torch.from_numpy(observation).float())

It's then stored for optimisation: predicted_values.append(value)

Then the action is taken in the environment.

It's then also used for bootstrapping: predicted_next_value = value.detach()

alpine-chamois commented 4 months ago

The book code looks like this:

def run_episode(worker_env, worker_model, N_steps=10):
    raw_state = np.array(worker_env.env.state)
    state = torch.from_numpy(raw_state).float()
    values, logprobs, rewards = [],[],[]
    done = False
    j=0
    G=torch.Tensor([0]) #A
    while (j < N_steps and done == False): #B
        j+=1
        policy, value = worker_model(state)
        values.append(value)
        logits = policy.view(-1)
        action_dist = torch.distributions.Categorical(logits=logits)
        action = action_dist.sample()
        logprob_ = policy.view(-1)[action]
        logprobs.append(logprob_)
        state_, _, done, info = worker_env.step(action.detach().numpy())
        state = torch.from_numpy(state_).float()
        if done:
            reward = -10
            worker_env.reset()
        else: #C
            reward = 1.0
            G = value.detach()
        rewards.append(reward)
    return values, logprobs, rewards, G

alpine-chamois commented 4 months ago

In the code above, the bootstrapped value is used thus:

It's returned from the model: policy, value = worker_model(state)

It's then stored for optimisation: values.append(value)

Then the action is taken in the environment.

It's then also used for bootstrapping: G = value.detach()

alpine-chamois commented 4 months ago

So they're doing the same thing. The book states:

"If the episode finishes before N steps, the last return value will be set to 0 (since there is no next state when the game is over) as it was in the Monte Carlo case. However, if the episode hasn’t finished by N steps, we’ll use the last state value as our prediction for what the return would have been had we kept playing—that’s where the bootstrapping happens."

The alignment of the vectors seems good too - we have the last log action probability, predicted value, and entropy, along with the reward for taking the associated action. It's that reward that requires step to be called afterwards. Next time through the loop we will pick up the policy and value for the new state.

So, in summary, I think the code here and the code in the book are in alignment, and are correct.

alpine-chamois / actor-critic

Bootstrapped value might be out of date #1