araffin / learning-to-drive-in-5-minutes

Implementation of reinforcement learning approach to make a car learn to drive smoothly in minutes
https://towardsdatascience.com/learning-to-drive-smoothly-in-minutes-450a7cdb35f4
MIT License
287 stars 87 forks source link

Why does sac training stop after train_freq steps? [question] #6

Closed tleyden closed 5 years ago

tleyden commented 5 years ago

I noticed that in this code it resets the environment after hitting train_freq steps: https://github.com/araffin/learning-to-drive-in-5-minutes/blob/c46338cfbfd7b316b1992247c302783a8cb6d36a/algos/custom_sac.py#L122-L126

whereas in the baseline implementation, it does not:

https://github.com/hill-a/stable-baselines/blob/fddf169875154f6129071045f0a6f99614c490a5/stable_baselines/sac/sac.py#L416-L434

                if step % self.train_freq == 0:
                    mb_infos_vals = []
                    # Update policy, critics and target networks
                    for grad_step in range(self.gradient_steps):
                        if self.num_timesteps < self.batch_size or self.num_timesteps < self.learning_starts:
                            break
                        n_updates += 1
                        # Compute current learning_rate
                        frac = 1.0 - step / total_timesteps
                        current_lr = self.learning_rate(frac)
                        # Update policy and critics (q functions)
                        mb_infos_vals.append(self._train_step(step, writer, current_lr))
                        # Update target network
                        if (step + grad_step) % self.target_update_interval == 0:
                            # Update target network
                            self.sess.run(self.target_update_op)
                    # Log losses and entropy, useful for monitor training
                    if len(mb_infos_vals) > 0:
                        infos_values = np.mean(mb_infos_vals, axis=0)

I was surprised to see that during training on a track it reset even though it was doing well, and it seemed to be because of this code, since I noticed the "Additional training" log output line.

I'm curious, what is the reasoning behind the env.reset() here?

araffin commented 5 years ago

Hello,

This is a hack to keep training from time to time, otherwise, as this custom sac version only trains after each reset (each end of episode), it won't train until the end of an episode. You can remove that or set a high "train_freq" so it does not happen.

tleyden commented 5 years ago

Makes sense, thanks!