logic of mb_rewards - Githubissues

araffin / learning-to-drive-in-5-minutes

Implementation of reinforcement learning approach to make a car learn to drive smoothly in minutes

https://towardsdatascience.com/learning-to-drive-smoothly-in-minutes-450a7cdb35f4

MIT License

287 stars 87 forks source link

logic of mb_rewards #35

Closed eliork closed 3 years ago

eliork commented 3 years ago

https://github.com/araffin/learning-to-drive-in-5-minutes/blob/ccb27e66d593d6036fc1076dcec80f74a3f5e239/algos/custom_ppo2.py#L165

Hi, I think the logic of updating mb_rewards is wrong here. for example, say we run 1 environment for 2048 steps, since mb_rewards.append(rewards) will append all the rewards, we might have a list with size > 2048, then, there is a check if the len of mb_rewards is greater than 2048 and then the loop breaks, but we are still with a len of mb_reward that is greater than 2048. this will result in an error in this line as numpy can't reshape an array of size greater than 1*2048 to size 1,2048.

araffin commented 3 years ago

Hello,

this will result in an error in this line as numpy can't reshape an array of size greater than 1*2048 to size 1,2048.

this is only for logging, so not a real issue.

However, the GAE computation is probably wrong. This code was an attempt to have an episodic PPO (which can be applied on a robot), but was only partially tested... Anyway, unless you want to learn end-to-end, I would recommend to use AE + SAC, and if you want to use PPO, the Stable-Baselines3 version should be easier to tweak ;)

eliork commented 3 years ago

Hey, thanks for your answer. After reading the code again I understood it's not a real issue, though it does create a runtime error, from time to time, and kill the training proces so in order to avoid that I slice the true_reward and the masks lists to the size of self.n_steps like this

if writer is not None:
    self.episode_reward = total_episode_reward_logger(self.episode_reward,
         true_reward[:self.n_steps].reshape((self.n_envs, self.n_steps)),
         masks[:self.n_steps].reshape((self.n_envs, self.n_steps)),
         writer, n_timesteps)

I am trying to relax your approach from the continuous action space to the discrete action space, hence I am trying to use PPO algorithm. I trained a VAE with my own set of images, with z_size of 512. I don't understand why the GAE computation is wrong, it looks identical to the one here could you please elaborate a little bit? Thank you

araffin commented 3 years ago

After reading the code again I understood it's not a real issue, though it does create a runtime error, from time to time, and kill the training proces so in order to avoid that I slice the true_reward and the masks lists to the size of self.n_steps like this

yes, that's one solution, or you can comment it too.

I don't understand why the GAE computation is wrong, it looks identical to the one here

for step in reversed(range(self.n_steps)): should be replaced with the true number of steps (len(mb_rewards) for instance). and the rest of the code may need to be adapted. A quick solution would be to truncate mb_obs, mb_rewards, ... to n_steps.

I am trying to relax your approach from the continuous action space to the discrete action space, hence I am trying to use PPO algorithm.

you should maybe take a look at DQN / QR-DQN then (cf Stable-Baselines3 / SB3 contrib).

eliork commented 3 years ago

(cf Stable-Baselines3 / SB3 contrib).

Thank you very much for your answers. excuse me for being a little bit new here :-) but what does that mean? what is cf? Thanks

araffin commented 3 years ago

excuse me for being a little bit new here :-) but what does that mean? what is cf?

sorry, I meant take a look at https://github.com/DLR-RM/stable-baselines3 (for DQN) and https://github.com/Stable-Baselines-Team/stable-baselines3-contrib (for QR-DQN) I should have written cf. (which stands for "confer", a latin word to refer to something for more information ;))

eliork commented 3 years ago

Thank you very much for your help! keep up the good work!