IntelLabs / coach

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms
Apache License 2.0
2.32k stars 460 forks source link

PolicyOptimization Agents do not log signals to csv #448

Open crzdg opened 4 years ago

crzdg commented 4 years ago

I encountered a strange behavior.

For ClippedPPO, PPO and ActorCritic I was not able to get the signals defined in there init-Method. Loss, Gradients, Likelihood, KL Divergence, etc...

I'm not sure if it is a issue in my Environment implementation. But DQN logs its signals. I also checked the signals dumpy by update_log. For the mentiond agents, the self.episode_signals includes duplicate entries for the signals not logged. As the signals are defined on several inheritents of the agent-class but still saved to self.episode_signals multiple times. Obviously only the latest created will be updated with values in the train-Method.

Also, It could be to the behavior of updating signals before every episode. As gradients are only available after training they might get reseted after last training iteration as a new episode starts.

However, I do have experiments with ClippedPPO where those signals were logged, but I can't recreate this.

Any suggestions?

crzdg commented 4 years ago

I found the causing behavior.

Following setup,

ClippedPPOAgent with num_consecutive_playing_steps = EnvironmentEpisodes(15)

CSV dumper is set to dump_signals_to_csv_every_x_episodes = 5

Before the training after 15 episodes the csv will be dumped due to 15 % 5 = 0, the last 5 episodes will be dumped including episode 15 with no training values (loss, graidents, etc...).

The training happens and training values will be generated. The training values will be saved in the 15th episode in the loggers pandas datarframe. As this line is already dumped it never will be written to the CSV.

I assume this was caused due to #113

I updated as following. Simply added a decrement of last_line_idx_written_to_csv in the logger.

    def train(self):
        if self._should_train():
            for network in self.networks.values():

            dataset = self.memory.transitions
            update_internal_state = self.ap.algorithm.update_pre_network_filters_state_on_train
            dataset = self.pre_network_filter.filter(dataset, deep_copy=False,
            batch = Batch(dataset)

            for training_step in range(self.ap.algorithm.num_consecutive_training_steps):

                # take only the requested number of steps
                if isinstance(self.ap.algorithm.num_consecutive_playing_steps, EnvironmentSteps):
                    dataset = dataset[:self.ap.algorithm.num_consecutive_playing_steps.num_steps]
                batch = Batch(dataset)

                self.train_network(batch, self.ap.algorithm.optimization_epochs)

            for network in self.networks.values():

            self.training_iteration += 1
            # should be done in order to update the data that has been accumulated * while not playing *
            self.agent_logger.last_line_idx_written_to_csv -= 1
            return None