IBM / rl-testbed-for-energyplus

Reinforcement Learning Testbed for Power Consumption Optimization using EnergyPlus

MIT License

191 stars 77 forks source link

[Question] Make the environment work with stable_baselines #34

Open maxvfischer opened 5 years ago

maxvfischer commented 5 years ago

I've tried to make the environment work with the baselines fork stable_baselines (https://github.com/hill-a/stable-baselines). It runs, but the results shown when I'm running plot_energyplus is always the same for all episodes (both for PPO2 and TRPO):

Reward                    ave=-11.80, min=-16.24, max= 0.11, std= 1.76
westzone_temp             ave=67.03, min=23.86, max=86.97, std= 7.80
eastzone_temp             ave=66.73, min=23.86, max=87.65, std= 7.82
Power consumption         ave=322,506.33, min=177,024.88, max=362,582.24, std=24,882.39
pue                       ave= 1.01, min= 1.01, max= 1.11, std= 0.00

westzone_temp distribution
    degree 0.0-0.9 0.0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9
    -------------------------------------------------------------------------
    17.0C  0.0%    0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%
    18.0C  0.0%    0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%
    19.0C  0.0%    0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%
    20.0C  0.0%    0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%
    21.0C  0.0%    0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%
    22.0C  0.0%    0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%
    23.0C  0.0%    0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%
    24.0C  0.0%    0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%
    25.0C  0.0%    0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%
    26.0C  0.0%    0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%
    27.0C  0.0%    0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%

The environment seems to run properly, it outputs "Continuing Simulation at MM/DD for WHOLEYEARDAY" and "Updating Shadowing Calculations, Start Date=MM/DD".

Questions: 1) I'm using vectorised environments to be able to parallelise the training (One env = DummyVecEnv, many envs = SubprocVecEnv). This might be a problem? 2) Is there some magic integration with baselines when building patched EnergyPlus and installing the executables? 3) Can someone point me in a direction of how I can get it to work with stable_baselines?

biemann commented 4 years ago

I have tried to update to Stable Baselines as well and run into the same issues.

The problem comes from the fact that Stable Baselines and Baselines do not normalise the action space in the same way. In stable-baselines/common/runners.py, they do the following (l.151):

# Clip the actions to avoid out of bound error
if isinstance(env.action_space, gym.spaces.Box):
            clipped_action = np.clip(action, env.action_space.low, env.action_space.high)

Here, the actions are normalised (they have values in [-1,1] plus some noise) before we clip them. However, the env.action_space.low is not normalised. Hence, the clipped_action will always be [10, 10, 1.75, 1.75]

The action will be scaled back in energyplus_model.py in the set_action function and then clipped again. Hence, the resulting action will always be [40, 40, 7, 7] and therefore, the results of the training are so bad, that they don't appear in the plot anymore (the original plot only shows temperatures in the range [0,40])

So for on-policy methods (PPO, TRPO), we could just comment out this part of the code

Edit: For off policy methods (SAC, DDPG), my solution is to use a normalised gym environment using the VecNormalize class of stable baselines:

I modified run_energyplus.py to:

from stable_baselines.common.vec_env import DummyVecEnv, VecNormalize

env = make_energyplus_env(env_id, workerseed)
env = DummyVecEnv([lambda: env])
env = VecNormalize(env)

Since the action will be automatically scaled back, we should not do it again in energyplus_model.py, so I modified the set_actionfunction to:

def set_action(self, normalized_action):
       self.action_prev = self.action
       self.action = normalized_action

triton99 commented 3 years ago

Hi, I saw that in your result above that your power consumption was approximately 322kW, Did you change any parameters about the HVAC system or IT Load in the EnergyPlus idf file. Because my simulation is just roundly 120kW but I need a higher IT Load in my model.

biemann commented 3 years ago

Hi,

I am not sure if I can answer for him. But, I had almost identical results when trying to use Stable Baselines initially. So I think he used the same original file to have these results.

The terrible results were due to the fact that StableBaselines and Baselines deal with unnormalised states in a different way. A clearly better solution than the one I outlied above (which is a dirty hack) would be to normalise the states and scale back the actions directly in the gym environment.

triton99 commented 3 years ago

Hi, Thank you for your information, I am using stable baselines too, maybe the reason was not high IT Load in the EnergyPlus model.

biemann commented 3 years ago

I am not sure what exactly the problem is. What algorithm are you using? Assuming you use the same case study than provided, the results should be better than 120 kW (if trained long enough) for most algorithms.

We can modify the IT load in the file 2ZoneDataCenterHVAC_wEconomizer_Temp_Fan.idf around line 2900. I am not completely sure how realistic it would be if we change these parameters without changing others. I am not too familiar with EnergyPlus and HVAC modelling.

triton99 commented 3 years ago

I did the TRPO algorithm before and I am doing the DQN algorithm now to see the result. Have you tried DQN yet? Yeah, I have better results than 120kW (120kW was mostly the highest power consumption when I trained it long enough).

What is line 2900 for the IT Load you mentioned above (you change the value 200W/m2 or something)? I am still doing it and find why when I increase the value 200W/m2 to over 600W/m2 and my simulation did not work.

Here is clearly my problem (you can have a look if you want): https://unmethours.com/question/54422/limit-value-of-density-ite-load-in-zone/

biemann commented 3 years ago

Ok, then I guess you are far more into EnergyPlus than I am. I am from the ML side, so I cannot really help you here.

But at first guess, I think that the problem is that if you augment the IT load significantly without changing the parameters, the HVAC equipment is not adapted anymore to cool down the building accordingly. If you modify it slightly, does it still work? You would probably need to also augment the size of the data centre, as well as maybe considering more powerful coolers. But, changing these parameters seem to require a lot of domain knowledge in order to keep the simulation realistic.

I did not try DQN, mostly because it assumes discrete action spaces. You would probably need to discretise the action space (maybe Stable Baselines does something like this automatically, I am not sure). This can however give problems, as a fine enough discretisation would require more and more actions. We have a 4-dimensional action space here. This would make the algorithm quite difficult to train in practice.

revathij commented 3 years ago

@biemann and @triton99 ,

I am trying to use the TRPO algorithm and getting the following error. Could anyone help on this issue.

File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ecocc/rl-testbed-for-energyplus/baselines_energyplus/run_mpi/run_energyplus.py", line 181, in main() File "/home/ecocc/rl-testbed-for-energyplus/baselines_energyplus/run_mpi/run_energyplus.py", line 177, in main train(args.env, num_timesteps=args.num_timesteps, seed=args.seed) File "/home/ecocc/rl-testbed-for-energyplus/baselines_energyplus/run_mpi/run_energyplus.py", line 106, in train model.learn(total_timesteps=num_timesteps) File "/home/ecocc/.local/lib/python3.6/site-packages/stable_baselines/trpo_mpi/trpo_mpi.py", line 333, in learn seg = seg_gen.next() File "/home/ecocc/.local/lib/python3.6/site-packages/stable_baselines/common/runners.py", line 114, in traj_segmentgenerator action, vpred, states, = policy.step(observation.reshape(-1, *observation.shape), states, done) File "/home/ecocc/.local/lib/python3.6/site-packages/stable_baselines/common/policies.py", line 576, in step {self.obs_ph: obs}) File "/home/ecocc/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run run_metadata_ptr) File "/home/ecocc/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1149, in _run str(subfeed_t.get_shape()))) ValueError: Cannot feed value of shape (1, 1, 5) for Tensor 'input/Ob:0', which has shape '(?, 5)' Exception ignored in: <bound method EnergyPlusEnv.del of <gym_energyplus.envs.energyplus_env.EnergyPlusEnv object at 0x7fed1e3fc7f0>> Traceback (most recent call last): File "/home/ecocc/rl-testbed-for-energyplus/gym_energyplus/envs/energyplus_env.py", line 81, in del File "/home/ecocc/rl-testbed-for-energyplus/gym_energyplus/envs/energyplus_env.py", line 177, in stop_instance File "/home/ecocc/rl-testbed-for-energyplus/gym_energyplus/envs/energyplus_env.py", line 151, in count_severe_errors NameError: name 'open' is not define

@Biemann,

I have used your code to try the stable baselines for rl test bed and got the above error when I run the TRPO algorithm.

Could you please explain me the changes that I need to do for TRPO(on policy)

I am trying to run the td3 and getting the following errror. Please let me know.

biemann commented 3 years ago

I have a misunderstanding. I guess you are trying to use TRPO with Stable Baselines or is it the original Baselines you are using? Was this the error from TD3?

When I did experiments with TRPO, I used the original Baselines implementation (I actually was using Stable Baselines 3 in Pytorch, that does not have TRPO implemented now, but this may change in the next weeks)

Actually, my solution when I posted this was a turnaround and not a practical solution. What I did now was normalising the environment directly in EnergyPlus Gym. This approach is clearly more scalable, as it would make it possible to use the same code for other libraries (e.g. RLLib) in an uniform way. Normalising states is a general good practice, so some libraries assume normalised states as input.

What I modified was the following:

In energyplus_model_2ZoneDataCenterHVAC_wEconomizer_Temp_Fan.py in setup_spaces()

I changed

  lo = 10.0
  hi = 40.0
  flow_hi = 7.0
  flow_lo = flow_hi * 0.25
  self.action_space = spaces.Box(low =   np.array([ lo, lo, flow_lo, flow_lo]),
                                 high =  np.array([ hi, hi, flow_hi, flow_hi]),
                                 dtype = np.float32)
  self.observation_space = spaces.Box(low =   np.array([-20.0, -20.0, -20.0,          0.0,          0.0,          0.0]),
                                      high =  np.array([ 50.0,  50.0,  50.0, 1000000000.0, 1000000000.0, 1000000000.0]),
                                      dtype = np.float32)

to:

self.low_action = np.array([10.0, 10.0, 1.75, 1.75])
self.high_action = np.array([40.0, 40.0, 7.0, 7.0])

self.low_obs = np.array([-20.0, -20.0, -20.0, 0.0, 0.0, 0])
self.high_obs = np.array([50.0, 50.0, 50.0, 1e9, 1e9, 1e9])

self.action_space = spaces.Box(low=-1, high=1, shape=(4,), dtype=np.float32)
self.observation_space = spaces.Box(low=-1, high=1, shape=(6,), dtype=np.float32)

(Add also these additional class variables)

I had then to modify in energyplus_model.py:

def set_action(self, normalized_action):
        # In Stable Baselines, the action seems to be normalized to [-1.0, 1.0].
        # So it must be scaled back into action_space by the environment.

        self.action_prev = self.action
        self.action = np.clip(self.action, self.action_space.low, self.action_space.high)
        self.action = self.low_action + (normalized_action + 1.) * 0.5 * (self.high_action - self.low_action)

I dont remember if this was everything that was needed to make it work. So please let me know if it works. I guess with this all algorithms should work out of the box without having to use VecNormalise, as I suggested. This works with Stable Baselines, but not with the original Baselines for example.

This answers the question quite indirectly, Your problem is that the shape of the input tensor is wrong, so I suggest trying to use the function 'flatten()` somewhere to get the shapes right (depending if you are using Pytorch or Tensorflow, the function would be different).

revathij commented 3 years ago

Thank you so much for your quick response.

Hi i have used the TRPO from stablebaselines 2. I thought you have successfully implemented the TRPO using the stable baselines2

Infact i have checked your stable baselines branch as below.

In EnergyPlusModel2ZoneDataCenterHVAC_wEconomizer_Temp.py, it has the code below

def setup_spaces(self):

Bound action temperature

    lo = 10.0
    hi = 40.0
    self.low_action = np.array([10.0, 10.0])
    self.high_action = np.array([40.0, 40.0])
    self.low_obs = np.array([-20.0, -20.0, -20.0, 0.0, 0.0])
    self.high_obs = np.array([50.0, 50.0, 50.0, 1e5, 1e5])
    #self.action_space = spaces.Box(low =   np.array([ lo, lo]),
    #                               high =  np.array([ hi, hi]),
    #                               dtype = np.float32)
    #self.observation_space = spaces.Box(low =   np.array([-20.0, -20.0, -20.0,          0.0,          0.0,          0.0]),
    #                                    high =  np.array([ 50.0,  50.0,  50.0, 1000000000.0, 1000000000.0, 1000000000.0]),
    #                                    dtype = np.float32)

    #self.action_space = spaces.Box(low=-1, high=1, shape=(2,), dtype=np.float32)
    #self.observation_space = spaces.Box(low=-1, high=1,shape=(5,), dtype=np.float32)
    self.action_space = spaces.Box(low=-1, high=1, shape=(2,), dtype=np.float32)
    self.observation_space = spaces.Box(low=-1, high=1,shape=(5,), dtype=np.float32)

In energyplus_model.py it has below

def set_action(self, normalized_action):
    print("Inside set_action", normalized_action)
    # In TPRO/POP1/POP2 in baseline, action seems to be normalized to [-1.0, 1.0].
    # So it must be scaled back into action_space by the environment.

    # ON-POLICY METHOD:
    #print(self.high_action - self.low_action)
    #self.action_prev = self.action
    #self.action = self.low_action + (normalized_action + 1.) * 0.5 * (self.high_action - self.low_action)
    print("Inside set_action after converting action values", self.action )
    # self.action = np.clip(self.action, self.action_space.low, self.action_space.high)

    # OFF-POLICY METHOD:
    self.action_prev = self.action
    self.action = normalized_action

In run_energyplus.py,

env = make_energyplus_env(env_id, workerseed)
env = DummyVecEnv([lambda: env])
#env = VecNormalize(env)
print("env",env)

model_trpo = TRPO('MlpPolicy', env, verbose=1,tensorboard_log="/home/ecocc/eplog/tensorboard/")
model_trpo.learn(total_timesteps=num_timesteps)
model_trpo.save("/home/ecocc/eplog/model/trpo_model")

Please help me to correct my understanding

revathij commented 3 years ago

Yep ..Infact I tried with TD3 also and got the below error.

Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ecocc/rl-testbed-for-energyplus/baselines_energyplus/run_mpi/run_energyplus.py", line 179, in main() File "/home/ecocc/rl-testbed-for-energyplus/baselines_energyplus/run_mpi/run_energyplus.py", line 175, in main train(args.env, num_timesteps=args.num_timesteps, seed=args.seed) File "/home/ecocc/rl-testbed-for-energyplus/baselines_energyplus/run_mpi/run_energyplus.py", line 112, in train model_td3.learn(total_timesteps=num_timesteps) File "/home/ecocc/.local/lib/python3.6/site-packages/stable_baselines3/td3/td3.py", line 204, in learn reset_num_timesteps=reset_num_timesteps, File "/home/ecocc/.local/lib/python3.6/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 273, in learn log_interval=log_interval, File "/home/ecocc/.local/lib/python3.6/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 469, in collect_rollouts action, buffer_action = self._sample_action(learning_starts, action_noise) File "/home/ecocc/.local/lib/python3.6/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 329, in _sample_action scaled_action = np.clip(scaled_action + action_noise(), -1, 1) ValueError: operands could not be broadcast together with shapes (1,2) (4,) Exception ignored in: <bound method EnergyPlusEnv.del of <gym_energyplus.envs.energyplus_env.EnergyPlusEnv object at 0x7fa2b4a1d828>> Traceback (most recent call last): File "/home/ecocc/rl-testbed-for-energyplus/gym_energyplus/envs/energyplus_env.py", line 81, in del File "/home/ecocc/rl-testbed-for-energyplus/gym_energyplus/envs/energyplus_env.py", line 177, in stop_instance File "/home/ecocc/rl-testbed-for-energyplus/gym_energyplus/envs/energyplus_env.py", line 151, in count_severe_errors NameError: name 'open' is not defined

In run_energyplus.py: action_noise = NormalActionNoise(mean=np.zeros(4), sigma=0.1 * np.ones(4)) model_td3 = TD3('MlpPolicy', env, verbose=1, action_noise=action_noise, tensorboard_log="/home/ecocc/eplog/tensorboard/") model_td3.learn(total_timesteps=num_timesteps) model_td3.save("/home/ecocc/eplog/model/td3_model")

biemann commented 3 years ago

Hi,

I assume you were looking at my fork and I indeed used that code once, so I guess you were looking at an older version of my code (I did not upload much on github and was doing experiments mostly locally) that was the first version that worked for me. I uploaded a few months ago a newer version, but I modified the case study to fit my needs.

I guess from a quick glance you are using a different version of the environment (the one that aims to control only temperature, whereas I used a version that controls both temperature and airflow rate). So in your case the action space would be two-dimensional, which is probably the reason of the error message

I think I tested TRPO with StableBaselines (but not StableBaselines 2) and I think it worked, but I am not so sure anymore. I was using Stable Baselines 3 for my experiments mostly (I moved there because I wanted an implementation of SAC and TD3 that was not available in the original Baselines and chose the Pytorch version for personal preferences. As TRPO is not implemetned there yet, I used TRPO from the original implementation).

I guess that it is quite confusing, as there are 4 versions of Baselines:

-Baselines, implemented by OpenAI, which is very poorly documented and not scalable at all (but the first open source implementation of many algorithms). -Stable Baselines, which uses the same code basis than Baselines, but with a better interface and documentation. -Stable Baselines 2, which is the Tensorflow 2 version of Stable Baselines, but with only few changes otherwise. -Stable Baselines 3, which is in Pytorch and the whole code is rewritten from scratch, but is not as complete yet.

revathij commented 3 years ago

Yep.. I am using the temperature control environment.. that's the reason. i have set as below

    self.action_space = spaces.Box(low=-1, high=1, shape=(2,), dtype=np.float32)
    self.observation_space = spaces.Box(low=-1, high=1,shape=(5,), dtype=np.float32)

Thanks a lot for the explanation of stables baselines difference.

a) In the code, it is still importing as below.

from stable_baselines3 import TD3, SAC, PPO, DDPG, A2C from sb3_contrib import TQC from stable_baselines3.common import logger from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize

from stable_baselines.common.policies import MlpPolicy from stable_baselines import TRPO

For me, I dont need specific stable baselines requiements. I just want to try the temperature environment to run in atleast 2 to 3 algorithms and check which one performs better. So far PPO runs fine..however TD3 is giving error. Could you please provide the help for TD3

Once again thanks a lot for your reply and help and waiting for the TD3 to be fixed also.

2) below are the packages details that are installed pycodestyle (2.7.0) pycparser (2.20) pycrypto (2.6.1) pycups (1.9.73) pyenchant (3.2.0) pyflakes (2.3.1) pyglet (1.5.0) Pygments (2.9.0) pygobject (3.26.1) pymacaroons (0.13.0) PyNaCl (1.1.2) pyparsing (2.4.7) pyRFC3339 (1.0) pytest (6.2.3) pytest-cov (2.11.1) pytest-env (0.6.2) pytest-forked (1.3.0) pytest-xdist (2.2.1) python-apt (1.6.5+ubuntu0.5) python-dateutil (2.8.1) python-debian (0.1.32) python-utils (2.5.6) pytype (2021.4.26) pytz (2021.1) pyxdg (0.25) ... ..... .... stable-baselines (2.10.2) stable-baselines3 (1.0) system-service (0.3) systemd-python (234) tensorboard (2.5.0) tensorboard-data-server (0.6.0) tensorboard-plugin-wit (1.8.0) tensorflow-estimator (1.14.0) tensorflow-gpu (1.14.0) termcolor (1.1.0) testresources (2.0.0) toml (0.10.2) torch (1.8.1) tornado (6.1) tqdm (4.60.0) traceback2 (1.4.0) typed-ast (1.4.3) typing-extensions (3.10.0.0) ubuntu-drivers-common (0.0.0) ufw (0.36) unattended-upgrades (0.1) unittest2 (1.1.0) urllib3 (1.26.4)

Should i need to install pytorch?

biemann commented 3 years ago

You still have the same error with TD3? If you use action noise, did you change the dimensions there as well ?

Note that the dirty hack I did with separating the on-policy and off-policy methods was only because I used unnormalised states initially. If the states are normalised, this is not necessary. I am feeling a bit ashamed of the code I wrote :)

You dont need to install Pytorch for Stable Baselines, but you need it for Stable Baselines 3

PS: I was working on a paper that was aiming to compare different algorithms with each other on this case study, that is currently under revision.

revathij commented 3 years ago

Hi I have changed the action noise and the error is gone. However, I am getting the Energy plus error. Do you need to modify anything in the IDF file.

Severe DualSetPointWithDeadBand: Effective heating set-point higher than effective cooling set-point - increase deadband if using unmixed air model ~~~ occurs in Zone=SOUTH ZONE Environment=WHOLEYEARDAY, at Simulation time=01/03 08:00 - 09:00 ~~~ LoadToHeatingSetPoint=-120495.947, LoadToCoolingSetPoint=-120737.064 ~~~ Zone TempDepZnLd=13290.33 ~~~ Zone TempIndZnLd=386302.60 ~~~ Zone Heating ThermostatSetPoint=20.00 ~~~ Zone Cooling ThermostatSetPoint=23.00 Fatal Program terminates due to above conditions. ...Summary of Errors that led to program termination: ..... Reference severe error count=1 ..... Last severe error=DualSetPointWithDeadBand: Effective heating set-point higher than effective cooling set-point - increase deadband if using unmixed air model

Do you have any idea?

How to fix the normalised clean code. Sorry I am new to Energyplus and RL implementation. If you guide, i can try also.

Have you succeeded in any particular algorithm in the comparison?

Thanks

biemann commented 3 years ago

Is this error specific to an algorithm? Or the same for all? Does the code crash or are the results just wrong like for OP?

It is a bit difficult to judge without code. Which file are you using? In any case, it seems that there is something wrong with the values you are sending to EnergyPlus. The load to heating setpoints are clearly too low:)

I did not change the idf file (but I used the version with the 4 dimensional action space)

From my experiments, SAC performed best. But, at the beginning I would suggest using PPO. It is stable to learn and quite fast in wall-clock time. It is not ideal for real-world applications, due to its bad data efficiency. But for testing new ideas, it is the one I am using. TD3 was the most tricky to make work (For example, action noise may not be a good idea in this environment)

Otherwise, I would also suggest to take a closer look at the Stable Baselines documentation, that is excellent. Maybe also testing on easier toy environments on OpenAI Gym first is a good idea to gain intuition. RL is a very difficult part of ML, where a lot of components intervene.

Edit: maybe you normalise the environment twice. I remember that I did this mistake once. If the environment is normalised, you should not use VecNormalize again.

revathij commented 3 years ago

Hi biemann,

Sorry for the late reply. Yes. Managed to fix the error and able to run for SAC, TD3 and PPO. In tensorboard results, it seems that PPO is doing better in the ep_rew_mean graph.

Please correct me if the normalisation part is correct:

In run_energyplus.py

env = make_energyplus_env(env_id, workerseed)
env = DummyVecEnv([lambda: env])
env = VecNormalize(env)

In EnergyPlusModel2ZoneDataCenterHVAC_wEconomizer_Temp.py

def setup_spaces(self):
    print("inside setup_spaces")
    # Bound action temperature
    lo = 10.0
    hi = 40.0
    self.low_action = np.array([10.0, 10.0])
    self.high_action = np.array([40.0, 40.0])
    self.low_obs = np.array([-20.0, -20.0, -20.0, 0.0, 0.0])
    self.high_obs = np.array([50.0, 50.0, 50.0, 1e5, 1e5])

    self.action_space = spaces.Box(low=-1, high=1, shape=(2,), dtype=np.float32)
    self.observation_space = spaces.Box(low=-1, high=1,shape=(5,), dtype=np.float32)

Could you please give the guidance for the above? Do we need to change the normalisation part for onpolicy and offpolicy?

biemann commented 3 years ago

How do the results look like? Are the results close to the ones of TRPO in the original implementation?

If you normalize the state space and action space, you dont need to normalise it again with VecNormalize.

revathij commented 3 years ago

Sorry, you mean the results from CSV?

In the above code, can we remove the env = VecNormalize(env) line? Is it applicable for onpolicy and offpolicy?

revathij commented 3 years ago

@biemann ,

I have removed vecNormalize and tabulated the zone air temperature and setpoint temperature. The values are appearing as strange, can you please clarify me.

ModelValues.xlsx

For example, for the below values, if we set the action for one zone input as 10, zone air temperature outcome as 21. However, if we set the temperature as 38, how come the zone air temperature as 22? Have you experienced this issue?

10 | 21.0346 38.78527 | 22.55974 40 | 28.13745 10 | 25.80909 10 | 21.81964

antoine-galataud commented 2 years ago

@takaomoriyama maybe it could be nice to provide a working example that uses stable-baselines (SB), since OpenAI baselines is in maintenance mode for quite some time now.

Few questions / challenges I have in mind:

ensure results obtained with SB are in line with original ones
should we use SB2 (tensorflow, but in maintenance) or SB3 (pytorch only)
use separate dependency requirements files for baselines and SB projects
test presence of relevant python modules when running a training (ie fail fast if you try to train using original trpo_mpi but installed SB dependencies)
opportunity to write some automated tests (using github actions) to make sure we can still train on all proposed frameworks

What do you think?