A simple and well styled PPO implementation. Based on my Medium series: https://medium.com/@eyyu/coding-ppo-from-scratch-with-pytorch-part-1-4-613dfc1b14c8.
MIT License
764
stars
116
forks
source link
How to fix: Broken with latest gym pip package #16
The env.step return values changed, so this code is now how to get it going:
# Number of timesteps run so far this batch
t = 0
while t < self.timesteps_per_batch:
# Rewards this episode
ep_rews = []
obs = self.env.reset()
if isinstance(obs, tuple):
obs = obs[0] # Assuming the first element of the tuple is the relevant data
terminated = False
for ep_t in range(self.max_timesteps_per_episode):
# Increment timesteps ran this batch so far
t += 1
# Collect observation
batch_obs.append(obs)
action, log_prob = self.get_action(obs)
obs, rew, terminated, truncated, _ = self.env.step(action)
if isinstance(obs, tuple):
obs = obs[0] # Assuming the first element of the tuple is the relevant data
# Collect reward, action, and log prob
ep_rews.append(rew)
batch_acts.append(action)
batch_log_probs.append(log_prob)
if terminated or truncated:
break
The
env.step
return values changed, so this code is now how to get it going:Note that you now have to check
terminated
andtruncated
return values. Latest documentation is here: https://www.gymlibrary.dev/api/core/Without this if you follow along with the blog post, it will fail at the end of Blog 3 at this step:
Also you need to update
Pendulum-v0
toPendulum-v1
.