IntelLabs / coach

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms
https://intellabs.github.io/coach/
Apache License 2.0
2.32k stars 459 forks source link

[Question] Heatup + Exploration Policy #243

Closed redknightlois closed 5 years ago

redknightlois commented 5 years ago

I wrote my own exploration policy which performs some guided exploration, however on checkpoint restore when I run heatup it doesnt look like the output of either the restored policy nor the guided exploration that I have written. In fact it looks akwardly random.

For reference in case I did something wrong, this is the exploration search code:

class ExpertGuided(Greedy):

    def __init__(self, action_space: ActionSpace, epsilon_schedule: Schedule, preheat: int):
        """
        :param action_space: the action space used by the environment
        :param epsilon_schedule: a schedule for the epsilon values
        """
        self.epsilon_schedule = epsilon_schedule
        self.environment = None
        self.run_whole = False
        self.preheat = preheat
        self.iteration = 0
        super().__init__(action_space)

    def set_environment(self, environment: BaseEnvironment):
        self.environment = environment

    def reset(self):
        rnd = np.random.uniform()
        if rnd < self.epsilon_schedule.current_value or self.iteration < self.preheat:
            self.run_whole = True
        else:
            self.run_whole = False
        self.iteration += 1

    def get_action(self, action_values: List[ActionType]) -> ActionType:

        # action values are none in case the exploration policy is going to select a random action
        if action_values is not None and self.environment is not None:
            if self.requires_action_values():
                if self.phase == RunPhase.TRAIN:
                    if self.run_whole:
                        return int(self.environment.expert_action)
                elif self.phase == RunPhase.HEATUP:
                    return int(self.environment.expert_action)

        return super().get_action(action_values)

And the startup code looks like:

    if args.restore or args.play:
        graph_manager.task_parameters.checkpoint_restore_dir = os.path.join(args.checkpoint, config.algorithm, 'checkpoint')
        graph_manager.restore_checkpoint()

    if not args.play:

        graph_manager.improve_steps.num_steps = num_steps
        iterations = int(training_config.timesteps / num_steps)

        graph_manager.heatup(EnvironmentSteps(1000 * env.max_steps))

        for i in range(0, iterations):
            printout(f'Executing {i+1}/{iterations} iterations')
            graph_manager.improve()
            graph_manager.save_checkpoint()

Any idea?

gal-leibovich commented 5 years ago

If after the restore, you look at the actions selected during the Heatup phase, then that might be the source of the issue. Heatup does not run the agent's choose_action, by default, and thus does not get into the exploration policy code. Instead it just chooses actions randomly. In order for the Heatup to use the agent's decisions, you should set the flag AgentParameters.algorithm.heatup_using_network_decisions.

redknightlois commented 5 years ago

That works.