Taking 'done' into consideration while calculating returns

bentrevett / pytorch-rl

Tutorials for reinforcement learning in PyTorch and Gym by implementing a few of the popular algorithms. [IN PROGRESS]

MIT License

262 stars 77 forks source link

def calculate_returns(self, rewards, dones, normalize = True): returns = [] R = 0 for r, d in zip(reversed(rewards), reversed(dones)): if d: R = 0 R = r + R * self.gamma returns.insert(0, R) returns = torch.tensor(returns).to(device) if normalize: returns = (returns - returns.mean()) / returns.std() return returns

Notebooks 1-7 all use Monte Carlo methods. That is each environment is run for a single episode, i.e. until the environment returns done = True, after which we then calculate the returns/advantages and update the policy parameters.

There is no need to check for done in the calculation of the returns/advantages as only the last state will have done = True, which is why R is initialized to zero.

I'll add the explanation to GAE when I get around to adding more detail to the notebooks - for now I'd recommend these two links:

bentrevett / pytorch-rl

Taking 'done' into consideration while calculating returns #1