astooke / rlpyt

Reinforcement Learning in PyTorch
MIT License
2.22k stars 322 forks source link

Handling Early Resets in Procgen Envs #169

Open jakegrigsby opened 4 years ago

jakegrigsby commented 4 years ago

I've been trying to set up a multiworker Rainbow DQN baseline for procgen, similar to what's described in Leveraging Procedural Generation for Benchmarking Reinforcement Learning. This is roughly how I'm handling the setup (based on example 7):

def make_env(game, num_levels, distribution_mode):

    class RlpytProcgenWrapper(gym.Wrapper):
       """
        Handle issues with procgen seeding and image axis order
       """

        def step(self, *args):
            o, r, d, i = self.env.step(*args)
            return np.transpose(o, (2, 0, 1)), r, d, i

        def reset(self):
            return np.transpose(self.env.reset(), (2, 0, 1))

        def seed(self, seed):
            return

     env = gym.make(f"procgen:procgen-{game}-v0", num_levels=num_levels, distribution_mode=distribution_mode)
     env = RlpytProcgenWrapper(env)
     return env

args = parse_args()

affinity = make_affinity(
        run_slot=args.run_slot,
        n_gpu=args.n_gpu,
        n_cpu_core=args.n_cpu_core,
        gpu_per_run=args.gpu_per_run,
    )

sampler = GpuSampler(
        EnvCls=GymEnvWrapper,
        env_kwargs=dict(env=make_env(args.game, args.num_levels, args.distribution_mode)),
        TrajInfoCls=TrajInfo,
        batch_T=args.batch_T,
        batch_B=args.batch_B,
        CollectorCls=GpuResetCollector,
        max_decorrelation_steps=args.max_decorrelation_steps,
        eval_n_envs=10,
        eval_env_kwargs=dict(env=make_env(args.game, args.eval_num_levels, args.eval_distribution_mode)),
        eval_max_steps=int(10e5),
        eval_max_trajectories=20,
    )

algo = CategoricalDQN(...)

num_actions = make_env(args.game, args.num_levels, args.distribution_mode).action_space.n

agent = CatDqnAgent(n_atoms=args.n_atoms, eps_final=args.eps_final, 
                        ModelCls=AtariCatDqnModel, model_kwargs={'image_shape':(3, 64, 64), 'output_size':num_actions, 'dueling':True})

runner = SyncRlEval(
        algo=algo,
        agent=agent,
        sampler=sampler,
        n_steps=args.n_steps,
        log_interval_steps=1e4,
        affinity=affinity,
    )

config = vars(args)
name = f"rainbow_{args.game}_{args.distribution_mode}"
log_dir = f"{args.game}"
with logger_context(log_dir, args.run_ID, name, config, snapshot_mode="last", override_prefix=True, use_summary_writer=True):
        runner.train()

Everything seems to run fine, but I get a lot of "Warning: Early Reset Ignored" messages from the procgen env, because they don't normally allow resets before the trajectory is finished. What is the best way to handle that with rlpyt? I've tried using different samplers and Gpu/CpuWaitResetCollector, but no luck.

astooke commented 4 years ago

OK interesting...the environment should only be reset when the done signal comes out True. Does this happen for a procgen env before the trajectory is finished?

Possibly related, the Atari env has some logic to do with episodic lives, where for RL we want to consider the episode done, but we don't reset the environment. First, the AtariEnv can output done=True but also env_info(traj_done=False) https://github.com/astooke/rlpyt/blob/85d4e018a919118c6e42fac3e897aa346d84b9c5/rlpyt/envs/atari/atari_env.py#L127-L129

Second, inside the collector, the environment doesn't get reset if done=True but env_info["traj_done"] is found and is False (defaults to the value of done if traj_done is not found). If done=True, the agent still gets reset regardless, in case it is recurrent, based on RL episodes: https://github.com/astooke/rlpyt/blob/85d4e018a919118c6e42fac3e897aa346d84b9c5/rlpyt/samplers/parallel/cpu/collectors.py#L45-L50

Of course you can change this logic anyway needed for procgen.

Let us know if any of this helps, and where the early resets were coming from?