BayesWatch / RL-GATE

MIT License
0 stars 0 forks source link

Implement an evaluator that can unroll a model into an RL environment and get metrics #3

Open AntreasAntoniou opened 4 months ago

AntreasAntoniou commented 4 months ago

Write an evaluator such that it can receive an environment, a model, and some seed etc, and then do unrolling and collect rewards etc.

AntreasAntoniou commented 4 months ago

Waiting for a template to be posted by either @AdamJelley or @trevormcinroe so I can get started.

AdamJelley commented 4 months ago

Hi @AntreasAntoniou, here's a simple eval function that you can use as a template:

@torch.no_grad()
def eval_actor(
    env: gym.Env, actor: Actor, device: str, n_episodes: int, seed: int
) -> np.ndarray:
    env.seed(seed)
    actor.eval()
    episode_rewards = []
    for _ in range(n_episodes):
        state, done = env.reset(), False
        episode_reward = 0.0
        while not done:
            action = actor.act(state, device)
            state, reward, done, _ = env.step(action)
            episode_reward += reward
        episode_rewards.append(episode_reward)

    actor.train()
    return np.array(episode_rewards)

It takes in an env and an actor (network that maps state->action). Hopefully pretty straightforward (not much has changed since 2019 here!).

The interesting part is probably device usage. The env is normally on the cpu and expects a np.ndarray action, but the actor is normally on gpu (unless you have a fancy jax/gpu env setup). Therefore you can either: 1) Move the actor to cpu at the beginning and then do everything on cpu, or 2) Move the state to gpu, do actor forward pass on gpu, and then move the action back to cpu at each timestep. For small networks often used in RL, 1) is often faster, but for larger networks as we'll likely be using we may need to do 2) as is done above (the act function is a wrapper around the forward pass that handles the transfer of state and action to and from gpu respectively). Hope that helps!

trevormcinroe commented 4 months ago

Small nitpick @AdamJelley @AntreasAntoniou -- it might be good to pass seeds: Sequence[int] and then randomly select a starting seed at the top of the eval loop. Perhaps this could allow for a more robust eval of the agent. Maybe like:

for _ in range(n_episodes):
        state, done = env.reset(seed=np.random.choice(seeds)), False

AFAIK, all envs can take a seed in .reset(), but not all actually use it. If that is the case here, then perhaps the random seed choice would go something like:

for _ in range(n_episodes):
        env.seed(np.random.choice(seeds))
        state, done = env.reset(), False