Reproducibility in agent evaluation without using Runner

AvisP commented 1 year ago

Environment

Grid2op version: 1.8.1
System: mac osx
LightSim2grid version: 0.6.0.post1
stable-baseline 3 version: 1.7.0

Bug description

I am not get the same results when evaluating an agent over an environment even after fixing the initial environment seed value. However the Runner class is able to generate same results on multiple execution. I have provided some code snippets using the DoNothingAgent

How to reproduce

Code snippet

from grid2op.Reward import LinesCapacityReward  # or any other rewards
from lightsim2grid import LightSimBackend  # highly recommended for training !
import grid2op
from grid2op.Agent import DoNothingAgent
from grid2op.Runner import Runner

env_name = "l2rpn_case14_sandbox"
env = grid2op.make(env_name,
                   reward_class=LinesCapacityReward,
                   backend=LightSimBackend()
                  )

agent = DoNothingAgent(env.action_space)

env.seed(0)
agent.seed(0)

obs = env.reset()
agent.reset(obs)

## Multiple execution of the block below is creating new reward value and time steps
reward = float(env.reward_range[0])
sum_reward = reward
done = False
time_step = 0

while not done:
    act = agent.act(obs, reward, done)
    obs, reward, done, info = env.step(act)  
    sum_reward += reward
    time_step += 1

print("Reward : ", sum_reward, " Time Steps : ", time_step)

## However the output of the Runner class is consistent after multiple execution
runner_params = env.get_params_for_runner()
runner = Runner(**runner_params)   #$ if nothing mentioned agentclass is DoNothingAgent and backendclass is PandaPower
res = runner.run(nb_episode=1,
                nb_process=1,
                add_detailed_output = True,
                path_save = "./tmp",
                )

# Print summary
print("Evaluation summary for DN:")
for _, chron_name, cum_reward, nb_time_step, max_ts, episode_data in res:
    msg_tmp = "chronics at: {}".format(chron_name)
    msg_tmp += "\ttotal score: {:.6f}".format(cum_reward)
    msg_tmp += "\ttime steps: {:.0f}/{:.0f}".format(nb_time_step, max_ts)
    print(msg_tmp)

Any suggestions on why is this happening would be appreciated.

BDonnot commented 1 year ago

Hi,

The runner class has been made specifically to solve this kind of issue. So my first answer would be: "if it works with the runner, try to use it as much as you can and won't have any issues regarding this kind of reproducibility". For evaluation purposes it's really the best thing to do and it can handle all types of agent.

Then I would also make sure to set the ID of the "chronics" (ie time series of generation and loads) you want to use before calling env. reset() with env.set_id(0) for example (or env.set_id(42) more information on the doc I would say).

Finally, I would read more about the gym api and the concept of "episode" in RL. Each time you call "env.reset(...)" you go to the next "episode" which is not the same as the previous one (if follows the same dynamic but it's not the same realization of the pomdp to be precise).

A more simple example can be and environment where you are asked to land a robot on the moon for example (see LunarLander-v2 doc) : the gravity and the forces applied to the robot when you do an action do not change between episodes (that's the dynamic of the mdp) but at each new episode the robot starts at a different position and the flags are moved too. This is why you do not get the same reward in this environment if you do exactly the same policy (in your case "do nothing" all the time). It's exactly the same in grid2op.

AvisP commented 1 year ago

Hi Benjamin,

Thank you for taking the time to provide a detailed explanation. I understand it better now and I was not aware of how to set chronic id through set_id() function.

But I did find a weird behavior while setting seed in jupyter notebook. Essentially if I don't set the seed and reset and print the observation in different cell then the environment for some reason doesn't start from the same initial point. I explain it with the help of Pendulum-v1 environment. Here you can see that seed is defined in cell 3, but when I am resetting it in cells 4 and 5, then the values are different. But when I define the seed and reset in cells 6 and 7 then the values are same. I do not know what is the possible reason for this weird behavior. Gym version is 0.21 and I can't upgrade it as latest of stable-baselines3 needs this Screen Shot 2023-02-16 at 10 24 39 PM

So after I discovered that when I evaluate the performance of an agent in the same cell in the following code then I able to reproduce the same results.

env.seed(0)
env.set_id(42)
agent.seed(0)
obs = env.reset()
agent.reset(obs)
print(obs.gen_p)

reward = float(env.reward_range[0])
sum_reward = reward
done = False
time_step = 0

while not done:
    act = agent.act(obs, reward, done)
    obs, reward, done, info = env.step(act)  
    sum_reward += reward
    time_step += 1

print("total reward ", sum_reward)

However, there is one more issue that I ran into and need your help (even if it's a quick fix). When I am trying to convert the grid2ops environment into gym one and then trying to set seed then I am getting an error.

Code:

import os
import grid2op
import numpy as np
from lightsim2grid import LightSimBackend  # highly recommended for training !
from grid2op.gym_compat import GymEnv

env_name = "l2rpn_case14_sandbox"
env = grid2op.make(env_name,
                   backend=LightSimBackend())

env_gym = GymEnv(env)
env_gym.observation_space.close()

env_gym.seed(1234)

The last line gives an error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [4], line 1
----> 1 env_gym.seed(1234)

File ~/.................../L2PRN/lib/python3.10/site-packages/grid2op/gym_compat/gymenv.py:136, in GymEnv._aux_seed(self, seed)
    132 def _aux_seed(self, seed=None):
    133     # deprecated in gym >=0.26
    134     if seed is not None:
    135         # seed the gym env
--> 136         super().reset(seed=seed)
    137         # then seed the underlying grid2op env
    138         max_ = np.iinfo(dt_int).max 

TypeError: Env.reset() got an unexpected keyword argument 'seed'

Also I noticed the env_gym doesn't have a set_id() function

BDonnot commented 1 year ago

Hello

Here you can see that seed is defined in cell 3, but when I am resetting it in cells 4 and 5, then the values are different

This is expected. As I told you, each time you reset an environment, you generate a new instance of your mdp. As this is random, then it's... not the same.

If you want to do "always the same instance of the environment" use the "env.seed(...)" function before each call to "env.reset(...)" this is exactly what it is for.

Imagine that "env.reset()" generates random number (it's basically what it does). If you call

env.seed(...)
env.reset()

you translate it to:

random.seed(...)
random.random()  # a fixed number, say a
random.random()  # a different number, fixed too, say b
random.random()  # a different number, fixed too, say c

In this case you expect a, b and c to be different. But if you do again:

random.seed(...)
random.random()  # same as above: a
random.random()  # same as above: b
random.random()  # same as above: c

So if you want to have always "a" then you need to do:

random.seed(...)
random.random()  # will be a

random.seed(...)
random.random()  # will be a

random.seed(...)
random.random()  # will be a

# etc.

This is a perfectly normal and expected behaviour. All stochastic environment behave the same way.

BDonnot commented 1 year ago

Also I noticed the env_gym doesn't have a set_id() function

This is because the gym env follows the api of openai gym.

And the "set_id' method is not part of this API.

I'm not sure what you want to do exactly but probably the "set_id" method can be called with gym_env.init_env.set_id(...) or even better, if you want to evaluate always on the same scenarios, use the env.train_val_split( SEE THE GRID2OP DOC) to "extract" a training, a validation and a test environment with some time series different in one another and you will not have to worry for that. You can even, with this method, create an environment with only 1 "chronics" / "scenarios" / "time series" and tadaaa you will always use this one.

Or even better, the recommended solution: use the runner and "convert" your gym agent to a grid2op agent thanks to the action_space of your gym_env:

from grid2op.Agent import BaseAgent

class Grid2opAgentFromGymAgent(BaseAgent):
    def __init__(self, grid2op_action_space, gym_env, gym_agent):
        super().__init__(grid2op_action_space)
        self.gym_env = gym_env
        self.gym_agent = gym_agent
    def act(self, observation, reward, done=False):
        gym_obs = self.gym_env.to_gym(observation)
        gym_act = self.gym_agent.act(gym_obs)
        return self.gym_env.from_gym(gym_act)

and use the runner with this agent.

AvisP commented 1 year ago

Hi Benjamin,

Thanks for explaining in detail the behavior of stochastic environments and also show a demo example of how to derive a custom gym agent from BaseAgent.

I was trying to see how the agent training works with new algorithms ( eg SAC) using the stable-baselines3 package. I followed the example of PPO_SB3 in the l2rpn_baseline package and came up with one for SAC_SB3. During training I wanted to evaluate how the agent performs on same environment after every 500 iterations. I tried to use the EvalCallback to do this but having some issues. I hope I was able to explain why I was doing all these.

You have answered all the queries related to this issue so I will close it.

BDonnot commented 1 year ago

Hello,

Oh this clarifies lots of my interrogations ! When you think the SAC_SB3 is available, do you think you could share it by making a PR on L2RPN baselines ?

AvisP commented 1 year ago

Haha, 'interrogations', I like your sense of humour. Yes of course with pleasure, I will create a PR in L2RPN baseline and add the SAC_SB3

BDonnot commented 1 year ago

The best way to perform what you want to do is first to split the environment into training and validation.

Call ONCE this type of script:

import grid2op
env_name = "l2rpn_case14_sandbox"  # or any other, really

env = grid2op.make(env_name)

full_path_data = env.chronics_handler.subpaths
chron_names = [os.path.split(el)[-1] for el in full_path_data]

nm_train, nm_val, nm_test = env.train_val_split(add_for_test="test",
                                                                            test_scen_id=chron_names[-10:],
                                                                            val_scen_id=chron_names[-20:-10]
                                                                            )
# NB of course I strongly recommend you to chose carefully which scenarios you put in the "test env" a
# and which you put in the "val env" because here you will have only one "type of month" (*eg*
# novembre) in the test and val envs... Which might not be ideal

Then in your training script you call this type of things

import grid2op
from grid2op.gym_compat import GymEnv
env_name = "l2rpn_case14_sandbox"  # or any other, really

g2op_train = grid2op.make(f"{env_name}_train", backend=...)
g2op_eval = grid2op.make(f"{env_name}_val", backend=...)

env_train = GymEnv(g2op_train)
env_eval = GymEnv(g2op_eval)

And you are sure that the same scenarios (the one with the name in val_scen_id) will be used.

Now you just need to "hack" the callback of stable baselines to call "env_eval.seed()" and "env_eval.init_env.set_id(0)" (for example). If you give me an example I might be able to help you.

AvisP commented 1 year ago

Hi Benjamin,

Thanks for mentioning about train_val_split function, it is very useful in splitting up the chronics into different groups. I took a look into the folders that gets created and understood that the folder chronics contains some folders with index numbers that contains the chronics.

I tried to do a simple test of training an agent on a single instance of the chronic and thereafter compared it with a DoNothing agent. I created a folder called l2rpn_case14_sandbox_train_sample where in the chronics folder there is only 0000 in it and nothing else. I was expecting it to overfit on this instance.

But after training on this single instance for 100,000 iterations, the trained agent sometimes failed to beat the DoNothing agent. I tried changing the NN architecture, learning rate, normalizing of observation and action but still it either only outperforms DoNothing agent by small margin or fails to do so. The code that I used

import re
import copy
import grid2op
from grid2op.Reward import LinesCapacityReward  # or any other rewards
from grid2op.Chronics import MultifolderWithCache  # highly recommended
from lightsim2grid import LightSimBackend  # highly recommended for training !
from l2rpn_baselines.PPO_SB3 import train, evaluate
from grid2op.Runner import Runner

env_name = "l2rpn_case14_sandbox_train_sample"
obs_attr_to_keep = ["day_of_week", "hour_of_day", "minute_of_hour", "prod_p", "prod_v", "load_p", "load_q",
                    "actual_dispatch", "target_dispatch", "topo_vect", "time_before_cooldown_line",
                    "time_before_cooldown_sub", "rho", "timestep_overflow", "line_status"]
act_attr_to_keep = ["redispatch"]

env = grid2op.make(env_name,
                    reward_class=LinesCapacityReward,
                    backend=LightSimBackend(),
                    chronics_class=MultifolderWithCache)
env.chronics_handler.real_data.set_filter(lambda x: re.match(".*00$", x) is not None)
env.chronics_handler.real_data.reset()

try:
    train(env,
            iterations=100_000,  # any number of iterations you want
            logs_dir="./logs/PPO_SB3",  # where the tensorboard logs will be put
            save_path="./saved_model/PPO_SB3",  # where the NN weights will be saved
            name="Single_chronic",  # name of the baseline
            net_arch=[200, 200, 200],  # architecture of the NN
            save_every_xxx_steps=1000,  # save the NN every 2k steps
            obs_attr_to_keep=copy.deepcopy(obs_attr_to_keep),
            act_attr_to_keep=copy.deepcopy(act_attr_to_keep),
            # normalize_obs=True,
            # normalzie_act=True,
            )
finally:
    env.close()

print("******** Training Finished, Evaluation Starting *********")

env_val = grid2op.make(env_name,
                    reward_class=LinesCapacityReward,
                    backend=LightSimBackend()
                    )

trained_agent, res_eval = evaluate(
                    env_val,
                    nb_episode=1,
                    load_path="./saved_model/PPO_SB3", 
                    name="Single_chronic",
                    nb_process=1,
                    verbose=True,
                    )

# you can also compare your agent with the do nothing agent relatively
# easily
runner_params = env_val.get_params_for_runner()
runner = Runner(**runner_params)

res = runner.run(nb_episode=1,
                nb_process=1,
                )

print("Evaluation summary for DN:")
for _, chron_name, cum_reward, nb_time_step, max_ts in res:
    msg_tmp = "chronics at: {}".format(chron_name)
    msg_tmp += "\ttotal score: {:.6f}".format(cum_reward)
    msg_tmp += "\ttime steps: {:.0f}/{:.0f}".format(nb_time_step, max_ts)
    # msg_tmp += "\tEpisode data: {:.6f}".format(episode_data)
    print(msg_tmp)

This led me to believe that the agent is possibly not learning at all. Kindly provide some advise or suggestion if possible. Thanks

BDonnot commented 1 year ago

Hello,

Managing a powergrid is a difficult task.

Some things might help you:

reward is not appropriate, maybe you need to use another one, reward your agent when it successfully manage overflow for example
is the training over? In some papers PPO requires millions of steps to learn something. And usually the problems in the literature are easier than l2rpn
maybe the action space is too restricted : what is your agent doing! Does it takes actions? Are the actions making any sense? What are their impact?
usual suspects also include : normalization (have you properly normalized observation, actions and reward?), learning rate (have you calibrated it?), neural net size?, maybe your agent also has too much information (too many things on the observation) or not enough (too little on the observation) and all the other meta parameters of the RL algorithm you're using.

Unfortunately we don't have at RTE the "manpower" to release examples that performs better than do nothing for all grid. I hope to work on this "at some point" but lots of thing to do before unfortunately 😔

Maybe a good place to start would be to look at the previous code and see what meta parameters they're using, size of neural net etc.

I hope that helps

BDonnot commented 1 year ago

I'm closing this issue as it appears to have been fixed.

Grid2op / grid2op