Closed alpine-chamois closed 4 months ago
In the code above, the bootstrapped value is used thus:
It's returned from the model:
policy, value = model(torch.from_numpy(observation).float())
It's then stored for optimisation:
predicted_values.append(value)
Then the action is taken in the environment.
It's then also used for bootstrapping:
predicted_next_value = value.detach()
The book code looks like this:
def run_episode(worker_env, worker_model, N_steps=10):
raw_state = np.array(worker_env.env.state)
state = torch.from_numpy(raw_state).float()
values, logprobs, rewards = [],[],[]
done = False
j=0
G=torch.Tensor([0]) #A
while (j < N_steps and done == False): #B
j+=1
policy, value = worker_model(state)
values.append(value)
logits = policy.view(-1)
action_dist = torch.distributions.Categorical(logits=logits)
action = action_dist.sample()
logprob_ = policy.view(-1)[action]
logprobs.append(logprob_)
state_, _, done, info = worker_env.step(action.detach().numpy())
state = torch.from_numpy(state_).float()
if done:
reward = -10
worker_env.reset()
else: #C
reward = 1.0
G = value.detach()
rewards.append(reward)
return values, logprobs, rewards, G
In the code above, the bootstrapped value is used thus:
It's returned from the model:
policy, value = worker_model(state)
It's then stored for optimisation:
values.append(value)
Then the action is taken in the environment.
It's then also used for bootstrapping:
G = value.detach()
So they're doing the same thing. The book states:
"If the episode finishes before N steps, the last return value will be set to 0 (since there is no next state when the game is over) as it was in the Monte Carlo case. However, if the episode hasn’t finished by N steps, we’ll use the last state value as our prediction for what the return would have been had we kept playing—that’s where the bootstrapping happens."
The alignment of the vectors seems good too - we have the last log action probability, predicted value, and entropy, along with the reward for taking the associated action. It's that reward that requires step to be called afterwards. Next time through the loop we will pick up the policy
and value
for the new state.
So, in summary, I think the code here and the code in the book are in alignment, and are correct.
In https://github.com/alpine-chamois/actor-critic/blob/main/src/actorcritic/actor_critic_agent.py
train
, thepredicted_next_value
looks like it uses an out of datevalue
from before the latest action was taken (i.e. a value that has already been taken into account in the values for optimisation):Suggest checking DRL in Action code to see how they avoided this, and check that
log_action_probabilities
,rewards
,predicted_values
, andentropies
are aligned.