IBM / rl-testbed-for-energyplus

Reinforcement Learning Testbed for Power Consumption Optimization using EnergyPlus
MIT License
179 stars 74 forks source link

[Question] Could training at system timestep frequency be avoided? #9

Closed antoine-galataud closed 5 years ago

antoine-galataud commented 5 years ago

EnergyPlus has both a Timestep (4 = 15 min in this repo models) and "System Timesteps" (internal to EnergyPlus), as you described. System timesteps frequency varies from 1 per timestep to 1 per minute. System timestep is used as the default step frequency during agent training: an action is sent on every system timestep, a state is observed and a reward is computed.

My intuition is that the valuable thing here is to see how much an action did good by observing the state after a "full" timestep was executed, couldn't (shouldn't) system timesteps be skipped until simulation moves on to the next "full" timestep? For now, every action on system timestep leads to the same state (not exactly, some variables varies, but temperature doesn't), hence same reward. Not sure it does any harm.

The potential benefits would be a better applicability to real world deployment, and a faster training time.

Did you have the opportunity to test that?

takaomoriyama commented 5 years ago

@antoine-galataud, good point. Let me check what is happening. As you might know, by doing "touch /tmp/verbose2", you can see dump of observation and action for each system timestep as follows (don't forget "rm /tmp/verbose2" after done):

...
compute_reward: Tenv=  4.150, Tz1= 23.640, Tz2= 23.551, PUE=  1.041, Whole_Powerd2= 49832.8, ITE_Power= 47859.7, HVAC_Power=  1973.1, Act1= 10.000, Act2= 14.124, Act3=  1.750, Act4=  2.270
compute_reward: Tenv=  4.150, Tz1= 23.640, Tz2= 23.551, PUE=  1.041, Whole_Powerd2= 51626.7, ITE_Power= 49495.0, HVAC_Power=  2131.7, Act1= 10.000, Act2= 15.115, Act3=  1.847, Act4=  2.194
compute_reward: Tenv=  4.150, Tz1= 23.640, Tz2= 23.551, PUE=  1.041, Whole_Powerd2= 51659.3, ITE_Power= 49495.0, HVAC_Power=  2164.3, Act1= 10.000, Act2= 17.625, Act3=  1.937, Act4=  2.493
compute_reward: Tenv=  4.150, Tz1= 23.640, Tz2= 23.551, PUE=  1.041, Whole_Powerd2= 51630.3, ITE_Power= 49495.0, HVAC_Power=  2135.3, Act1= 10.000, Act2= 11.215, Act3=  1.756, Act4=  1.943
compute_reward: Tenv=  4.150, Tz1= 23.640, Tz2= 23.551, PUE=  1.041, Whole_Powerd2= 51626.0, ITE_Power= 49495.0, HVAC_Power=  2131.0, Act1= 10.000, Act2= 14.005, Act3=  1.750, Act4=  2.275
compute_reward: Tenv=  4.150, Tz1= 23.640, Tz2= 23.551, PUE=  1.041, Whole_Powerd2= 56696.3, ITE_Power= 49495.0, HVAC_Power=  7201.3, Act1= 10.000, Act2= 13.977, Act3=  1.803, Act4=  2.212
compute_reward: Tenv=  4.150, Tz1= 23.640, Tz2= 23.551, PUE=  1.041, Whole_Powerd2= 51602.1, ITE_Power= 49495.0, HVAC_Power=  2107.1, Act1= 10.000, Act2= 12.767, Act3=  1.750, Act4=  1.972
compute_reward: Tenv=  4.150, Tz1= 23.640, Tz2= 23.551, PUE=  1.041, Whole_Powerd2= 51664.5, ITE_Power= 49495.0, HVAC_Power=  2169.5, Act1= 10.000, Act2= 10.000, Act3=  2.037, Act4=  2.431

compute_reward: Tenv=  3.725, Tz1= 23.667, Tz2= 23.863, PUE=  1.043, Whole_Powerd2= 51629.3, ITE_Power= 49495.0, HVAC_Power=  2134.2, Act1= 10.000, Act2= 12.965, Act3=  1.855, Act4=  2.217
compute_reward: Tenv=  3.725, Tz1= 23.667, Tz2= 23.863, PUE=  1.043, Whole_Powerd2= 53426.1, ITE_Power= 51074.2, HVAC_Power=  2351.9, Act1= 10.000, Act2= 12.691, Act3=  1.750, Act4=  2.608
compute_reward: Tenv=  3.725, Tz1= 23.667, Tz2= 23.863, PUE=  1.043, Whole_Powerd2= 53416.2, ITE_Power= 51074.2, HVAC_Power=  2342.0, Act1= 10.000, Act2= 10.000, Act3=  2.000, Act4=  2.256
compute_reward: Tenv=  3.725, Tz1= 23.667, Tz2= 23.863, PUE=  1.043, Whole_Powerd2= 53417.7, ITE_Power= 51074.2, HVAC_Power=  2343.5, Act1= 10.000, Act2= 11.016, Act3=  1.907, Act4=  2.384

compute_reward: Tenv=  3.300, Tz1= 23.264, Tz2= 23.758, PUE=  1.046, Whole_Powerd2= 53441.6, ITE_Power= 51074.2, HVAC_Power=  2367.4, Act1= 10.000, Act2= 12.630, Act3=  1.797, Act4=  2.730
compute_reward: Tenv=  3.300, Tz1= 23.264, Tz2= 23.758, PUE=  1.046, Whole_Powerd2= 49070.3, ITE_Power= 46555.5, HVAC_Power=  2514.9, Act1= 10.000, Act2= 11.300, Act3=  1.750, Act4=  2.292
compute_reward: Tenv=  3.300, Tz1= 23.264, Tz2= 23.758, PUE=  1.046, Whole_Powerd2= 49056.9, ITE_Power= 46555.5, HVAC_Power=  2501.5, Act1= 10.000, Act2= 14.658, Act3=  1.750, Act4=  2.126
compute_reward: Tenv=  3.300, Tz1= 23.264, Tz2= 23.758, PUE=  1.046, Whole_Powerd2= 49075.7, ITE_Power= 46555.5, HVAC_Power=  2520.2, Act1= 10.000, Act2= 10.134, Act3=  1.750, Act4=  2.305

compute_reward: Tenv=  3.300, Tz1= 23.499, Tz2= 22.837, PUE=  1.054, Whole_Powerd2= 49049.8, ITE_Power= 46555.5, HVAC_Power=  2494.3, Act1= 10.000, Act2= 14.906, Act3=  1.750, Act4=  2.035
compute_reward: Tenv=  3.300, Tz1= 23.499, Tz2= 22.837, PUE=  1.054, Whole_Powerd2= 51632.6, ITE_Power= 49140.7, HVAC_Power=  2491.9, Act1= 10.000, Act2= 14.680, Act3=  1.750, Act4=  2.002
compute_reward: Tenv=  3.300, Tz1= 23.499, Tz2= 22.837, PUE=  1.054, Whole_Powerd2= 51643.3, ITE_Power= 49140.7, HVAC_Power=  2502.6, Act1= 10.000, Act2= 14.950, Act3=  1.750, Act4=  2.141
compute_reward: Tenv=  3.300, Tz1= 23.499, Tz2= 22.837, PUE=  1.054, Whole_Powerd2= 51661.3, ITE_Power= 49140.7, HVAC_Power=  2520.6, Act1= 10.000, Act2= 12.230, Act3=  1.825, Act4=  2.284
compute_reward: Tenv=  3.300, Tz1= 23.499, Tz2= 22.837, PUE=  1.054, Whole_Powerd2= 51639.5, ITE_Power= 49140.7, HVAC_Power=  2498.8, Act1= 11.100, Act2= 13.165, Act3=  1.750, Act4=  2.092
compute_reward: Tenv=  3.300, Tz1= 23.499, Tz2= 22.837, PUE=  1.054, Whole_Powerd2= 51631.9, ITE_Power= 49140.7, HVAC_Power=  2491.2, Act1= 10.000, Act2= 14.368, Act3=  1.753, Act4=  1.990
compute_reward: Tenv=  3.300, Tz1= 23.499, Tz2= 22.837, PUE=  1.054, Whole_Powerd2= 51661.9, ITE_Power= 49140.7, HVAC_Power=  2521.3, Act1= 10.000, Act2= 12.013, Act3=  1.893, Act4=  2.230
compute_reward: Tenv=  3.300, Tz1= 23.499, Tz2= 22.837, PUE=  1.054, Whole_Powerd2= 51636.5, ITE_Power= 49140.7, HVAC_Power=  2495.8, Act1= 10.000, Act2= 11.829, Act3=  1.815, Act4=  1.990
compute_reward: Tenv=  3.300, Tz1= 23.499, Tz2= 22.837, PUE=  1.054, Whole_Powerd2= 51675.1, ITE_Power= 49140.7, HVAC_Power=  2534.4, Act1= 10.000, Act2= 13.549, Act3=  1.850, Act4=  2.430
compute_reward: Tenv=  3.300, Tz1= 23.499, Tz2= 22.837, PUE=  1.054, Whole_Powerd2= 51627.9, ITE_Power= 49140.7, HVAC_Power=  2487.2, Act1= 10.000, Act2= 10.255, Act3=  1.775, Act4=  1.898

compute_reward: Tenv=  3.300, Tz1= 23.656, Tz2= 23.663, PUE=  1.051, Whole_Powerd2= 51641.3, ITE_Power= 49140.7, HVAC_Power=  2500.7, Act1= 10.000, Act2= 10.000, Act3=  1.750, Act4=  2.116
compute_reward: Tenv=  3.300, Tz1= 23.656, Tz2= 23.663, PUE=  1.051, Whole_Powerd2= 53898.1, ITE_Power= 51381.8, HVAC_Power=  2516.3, Act1= 10.000, Act2= 13.112, Act3=  1.750, Act4=  2.309
compute_reward: Tenv=  3.300, Tz1= 23.656, Tz2= 23.663, PUE=  1.051, Whole_Powerd2= 53920.3, ITE_Power= 51381.8, HVAC_Power=  2538.5, Act1= 10.000, Act2= 11.929, Act3=  1.968, Act4=  2.371
compute_reward: Tenv=  3.300, Tz1= 23.656, Tz2= 23.663, PUE=  1.051, Whole_Powerd2= 53895.5, ITE_Power= 51381.8, HVAC_Power=  2513.7, Act1= 10.000, Act2= 10.630, Act3=  1.750, Act4=  2.278
compute_reward: Tenv=  3.300, Tz1= 23.656, Tz2= 23.663, PUE=  1.051, Whole_Powerd2= 53927.2, ITE_Power= 51381.8, HVAC_Power=  2545.5, Act1= 10.000, Act2= 10.000, Act3=  1.750, Act4=  2.633
compute_reward: Tenv=  3.300, Tz1= 23.656, Tz2= 23.663, PUE=  1.051, Whole_Powerd2= 59131.9, ITE_Power= 51381.8, HVAC_Power=  7750.1, Act1= 10.000, Act2= 10.000, Act3=  1.753, Act4=  2.172

compute_reward: Tenv=  3.300, Tz1= 23.631, Tz2= 23.692, PUE=  1.049, Whole_Powerd2= 53878.9, ITE_Power= 51381.8, HVAC_Power=  2497.1, Act1= 10.000, Act2= 17.955, Act3=  1.750, Act4=  2.071
compute_reward: Tenv=  3.300, Tz1= 23.631, Tz2= 23.692, PUE=  1.049, Whole_Powerd2= 50309.8, ITE_Power= 47762.1, HVAC_Power=  2547.7, Act1= 10.000, Act2= 10.000, Act3=  1.750, Act4=  2.319
compute_reward: Tenv=  3.300, Tz1= 23.631, Tz2= 23.692, PUE=  1.049, Whole_Powerd2= 50292.2, ITE_Power= 47762.1, HVAC_Power=  2530.1, Act1= 10.000, Act2= 15.821, Act3=  1.750, Act4=  2.468
compute_reward: Tenv=  3.300, Tz1= 23.631, Tz2= 23.692, PUE=  1.049, Whole_Powerd2= 50322.1, ITE_Power= 47762.1, HVAC_Power=  2560.0, Act1= 10.000, Act2= 10.000, Act3=  2.097, Act4=  2.405

compute_reward: Tenv=  3.300, Tz1= 23.529, Tz2= 22.661, PUE=  1.053, Whole_Powerd2= 50290.3, ITE_Power= 47762.1, HVAC_Power=  2528.1, Act1= 10.000, Act2= 13.191, Act3=  1.880, Act4=  2.332
compute_reward: Tenv=  3.300, Tz1= 23.529, Tz2= 22.661, PUE=  1.053, Whole_Powerd2= 48077.3, ITE_Power= 45573.1, HVAC_Power=  2504.2, Act1= 10.000, Act2= 10.000, Act3=  1.750, Act4=  2.138
compute_reward: Tenv=  3.300, Tz1= 23.529, Tz2= 22.661, PUE=  1.053, Whole_Powerd2= 48067.5, ITE_Power= 45573.1, HVAC_Power=  2494.4, Act1= 10.000, Act2= 11.096, Act3=  1.750, Act4=  2.035

compute_reward: Tenv=  3.175, Tz1= 23.514, Tz2= 22.818, PUE=  1.055, Whole_Powerd2= 48062.3, ITE_Power= 45573.1, HVAC_Power=  2489.2, Act1= 10.000, Act2= 12.508, Act3=  1.750, Act4=  1.967
compute_reward: Tenv=  3.175, Tz1= 23.514, Tz2= 22.818, PUE=  1.055, Whole_Powerd2= 51150.1, ITE_Power= 48606.4, HVAC_Power=  2543.7, Act1= 10.000, Act2= 14.416, Act3=  1.750, Act4=  1.943
...

I inserted several blank lines to make several groups of same Tz1 and Tz2. Probably each group corresponds to a "Timestep". As you said, ITE_Power does not change in a group, but HVAC_power does.

Synchronize with RL-agent at every "Timestep" instead of "System Timestep" seems good idea. Please let me think how to when "Timestep" advances in EMS script.

antoine-galataud commented 5 years ago

@takaomoriyama I was hoping to find a simple way to differentiate system time steps and zone time steps, but so far I couldn't. The only way I found is to compare temperatures between steps: if it doesn't change we're in the same time step. Not very reliable if observed variables change, but it could be also an abstract method to implement for each model (returns list of variable names to compare).

I've implemented this solution for a different model than the 2 zones DC and it works quite well: it does reduce training time and trained policy performance is slightly better. I also found references of such implementation in Deep Reinforcement Learning for Building HVAC Control (they refer to it as "control time step").

I can propose it as a PR if you judge it interesting.

OnedgeLee commented 5 years ago

@takaomoriyama I was hoping to find a simple way to differentiate system time steps and zone time steps, but so far I couldn't. The only way I found is to compare temperatures between steps: if it doesn't change we're in the same time step. Not very reliable if observed variables change, but it could be also an abstract method to implement for each model (returns list of variable names to compare).

I've implemented this solution for a different model than the 2 zones DC and it works quite well: it does reduce training time and trained policy performance is slightly better. I also found references of such implementation in Deep Reinforcement Learning for Building HVAC Control (they refer to it as "control time step").

I can propose it as a PR if you judge it interesting.

Controlling with your reference would has some problem I think. If I check minutes, same time recurs twice. (example : 15m -> 30m -> 20m -> 30m ... , 30m repeated)

Because we have random process in policy, different action would be applied both of times.

I was trying to check time step temperature difference over 0.3C, and compute system timestep length, but I got unexpected results.

unexpected result

Maximum zone temperature change is 0.5293. According to reference and Energyplus code, number of system time step should be 2, but 6 system time step applied. 스크린샷 2019-07-12 오후 12 03 02

Could you give me some advice? I cannot explain why this happening. I'm suffering from this and having hard time applying RL right way.

takaomoriyama commented 5 years ago

@antoine-galataud Checked with your code and it seemed OK. Some speed up in convergence time as follows. Original code:

eprl-910

W/ Antoine's patch:

eprl-910-skip-system-timesteps