IBM / rl-testbed-for-energyplus

Reinforcement Learning Testbed for Power Consumption Optimization using EnergyPlus
MIT License
177 stars 74 forks source link

Shift in collected observations by one timestep #64

Closed antoine-galataud closed 2 years ago

antoine-galataud commented 2 years ago

Found this problem while on a model where HVAC system shuts off at night and with CurrentTime EnergyPlus built-in variable used as observation.

Problem description

On this model, I have HVAC starting at 6:00 AM. Current time is reported as an observation as fractional time. On a given episode, I get this following wrong trajectory reported

tmp1 tmp2 tmp3 time remaining ts rew action
16.0 11.3 5.3 6.25 51.0 1.000000 20.0
16.1 15.7 15.6 6.50 50.0 0.373034 23.5

What happens:

The trajectory we feed to the agent is the following:

ts 1 (e.g 6:15), obs vector at ts -2 (before HVAC on), rew for obs at ts -2, default action
ts 2 (e.g 6:30), obs vector at ts -2 (default action), rew for obs at ts -2, act 1, 
ts 3 (e.g 6:45), obs vector at ts -2 (act 1), rew for obs at ts -2, act 2
...

This is a shift of 1 timestep for the observations sent to the agent and for computing the reward.

This can be confirmed by asking EnergyPlus to output results in CSV or SQL:

time tmp1 tmp2 tmp3 tmp4 tmp5
2020-01-01 06:00:00 7.740999 7.270150 15.277604 4.518456 6.046737
2020-01-01 06:15:00 15.487861 15.406199 16.012610 15.586937 15.586152

Here we can see that there is no such shift (indoor temperatures have increased already at 06:15).

The correct trajectory should be:

ts 1 (e.g 6:15), obs vector at ts -1 (default action), rew for obs at ts -1, default action
ts 2 (e.g 6:30), obs vector at ts -1 (act 1), rew for obs at ts -1, act 1
ts 3 (e.g 6:45), obs vector at ts -1 (act 2), rew for obs at ts -1, act 2

Reason

I believe the incorrect calling point for EMS program is used. The one currently used is AfterPredictorAfterHVACManagers. As stated in https://bigladdersoftware.com/epx/docs/8-2/input-output-reference/group-energy-management-system-ems.html#energymanagementsystemprogramcallingmanager and https://bigladdersoftware.com/epx/docs/8-2/ems-application-guide/ems-calling-points.html#ems-calling-points, this calling point happens before HVAC calculations, so variables and meters values are still the one from previous time step.

Solution

Apply action(s) at AfterPredictorAfterHVACManagers calling point, but collect observations after zone time step is done, e.g. at EndOfZoneTimestepAfterZoneReporting calling point

antoine-galataud commented 2 years ago

@takaomoriyama please let me know your thoughts on this. I have tested the solution on one of my environments, and it does fix the reported issue. Performance (episode reward mean) is marginally improved, but convergence happens faster (tested on 4 trials).

antoine-galataud commented 2 years ago

I added an example of EnergyPlus model change to fix this problem. See #65

ZHANG-QINGANG commented 2 years ago

@antoine-galataud @takaomoriyama

Hi Antoine,

I encountered a similar problem when I check the trajectory of ITE and HVAC powers.

Firstly, I recorded the observations of the testbed by running the following pseudo-code (calling managers are all AfterPredictorAfterHVACManagers ).

Reset the environment and get observation o for i in range(n), run: Generate action a, according to o Execute a and obtain new observation o2 Store (o,a) _o←o2

The results are shown below image

I found that the power terms in the observation do not follow the CPU utilization rate, and there is always a “1 step latency” (row 10 and row 18) every time the utilization changes.

For row 10, the CPU utilization is 0.75. According to working process of E+, the zone load predictor will predict “ITEPower” according to the scheduled utilization 0.75. The predicted value should be around 102731, similar to row 10 to row 17. Then E+ adjust the HVAC manager accordingly. However, the observed value of the testbed for row 10 is 76788, which is similar to the value when the utilization is 0.5. Then Actions (E10, F10) are generated according to the observed value (B10, C10, D10). Thus, the generated supply flow rate of external controller is around 3.8, similar to the value when the utilization rate is 0.5.

Then, the problem arises. Because the predicted ITEPower of E+ is around 102731 while the supply flow rate is generated according to 76788, the zone temperatures will be higher than the desired value (H10 and I10. The target is 25℃). Therefore, I guess, for the object “@ExtCtrlObs,” the ITEPower it observes is the result of the last time step.

I also tested the solution you proposed here. However, I got "0" for all power terms.
image

To find the reason, I read the EMS manual. At time step t_1, the program “ExtCtrlBasedSetpointManager” will be called after the calling manager “AfterPredictorAfterHVACManager.” However, at this point, the sensor value of “Facility Total Building Electric Demand Power” has not been updated according to the CPU utilization rate U_schedule^1. Thus, the observed value of the testbed is still P_ITE^0 which is calculated according to U_schedule^0. This is no longer in line with the control logic of E+: _predict the load based on the utilization at time step t1 and then adjust the HVAC actuators.

image image

My proposed solution:

The results show that the ITE power will follow the schedule of utilization, as indicated in K10 of the excel sheet.

At the same time, the Markove Decision Process will be more clearer. For example, all terms in s(t) is the result of excute a(t-1) at s(t-1).

image

Hope I made it clear. I'm looking forward to your opinion.

antoine-galataud commented 2 years ago

@ZHANG-QINGANG Hi! Thank you for the detailed issue report.

At first sight, this looks like a slightly different problem than the original one: in the original issue there's a gap of one EnergyPlus time step between action decided by agent and observations collected (observations collected and reward computed on t-2 instead of t-1). In your report, the shift seems to come from within the same observation, but actions and observations seem well aligned, assuming that your excel spreadsheet:

See below for illustration: https___user-images githubusercontent com_53898971-mh

Then I'd say that your solution should come as a complement.

The question I have is about the tests you did with the proposed solution for the original issue, which consists in using EndOfZoneTimestepAfterZoneReporting for observations collection, and the fact that it results in 0W of power sent by EnergyPlus. Collecting observations near the end of the timestep should positively impact zones-related calculations, like indoor temperatures and meters. This is what I observe in my projects (they use HVAC power mainly), this is also stated in E+ documentation:

EndOfZoneTimestepAfterZoneReporting. This calling point happens each zone timestep just after the output reports are updated for zone-timestep variables and meters. This calling point is the last one of a timestep and is useful for making control decisions for the next zone timestep using the final meter values for the current zone timestep.

Could you please share your IDF file, or at least the EMS part?

ZHANG-QINGANG commented 2 years ago

Hi @antoine-galataud,

I attached my modified IDF file. It follows the structure of the 2 zone model provided by this testbed. It should be noted that I generated the excel sheet by using the following pseudo-code, not the reported CSV file of E+. The following figure is the code I used to record data. Then, in the same row, the temperature terms are from "o2". I think it is reasonable since the temperature of "o2" is the result of executing "a" at "o". What do you think?

I tested your proposed solution by directly running the IDF model you modified. I got "0" when I try to record the observations. I feel the "0" is a little weird. But I have not found the reason so far.

For my IDF file, I added several sensors to collect the power of Equipment. I also wrote a program named "CalculatePredictor" to calculate the cooling load.

image

Reset the environment and get observation o for i in range(n), run: Generate action a, according to o Execute a and obtain new observation o2 Store (o,a,o2) _o←o2

2ZoneDataCenterHVAC_wEconomizer_Temp_Fan_ZhangQingang.zip

antoine-galataud commented 2 years ago

Thank you @ZHANG-QINGANG for sharing your IDF file. This allowed to find a problem in the original fix I provided. The statement:

SET tmp_val1 = @ExtCtrlAct 0 7

should be in the ExtCtrlBasedSetpointManager program, and not in ExtCtrlBasedObservationCollector.

Find attached your file with a fix. Let me know if that works for you:

Thanks

2ZoneDataCenterHVAC_wEconomizer_Temp_Fan_ZhangQingang.idf.zip

ZHANG-QINGANG commented 2 years ago

Hi @antoine-galataud ,

I tested you new solution. The "0" reading issue was solved and the CPU-ITEPower was aligned.

However, it seems the zone temperature still cannot been well controlled, as illustrated in the following figure.

image

I guess, even though the observation was correct, the corresponding actions will be executed in the next time step.

image

What do you think?

antoine-galataud commented 2 years ago

@ZHANG-QINGANG I would say now that the alignment problem is solved, but the policy learned is sub-optimal as it is unable to anticipate on CPU utilization increase and/or temperature increase in zones. But this is a somewhat different problem.

Now the alignment seems correct and we observe in your last spreadsheet, for instance:

It's a partially observable MDP, nothing in the collected state at row 8 can lead to a transition towards a greater decided action.

ZHANG-QINGANG commented 2 years ago

@antoine-galataud
Yes, I agree with your opinion: it is unable to anticipate on CPU utilization increase and/or temperature increase in zones. Your solution is good.

My only concern is if the working process follows the figure in my last message, It might not follow the control logic of E+: Predict Equipment Load~Adjust HVAC Manager accordingly. Then, the control will be quite passive. Problems may arise when the utilization changes frequently and drastically. Here I attached a case where utilization changes from 0.2 to 0.8. Of course, this can remain as future research.

It is a pleasure to discuss with you, very helpful. :-)

image

antoine-galataud commented 2 years ago

@ZHANG-QINGANG thank you, glad I could help. A pleasure for me too!

antoine-galataud commented 2 years ago

@takaomoriyama the problem is now fixed with the new fix I pushed in #77. I suggest we re-close this issue, unless you think otherwise.

takaomoriyama commented 2 years ago

@antoine-galataud OK. That's great. If we need some more discussion, let's open another issue.

@ZHANG-QINGANG Thank you for your contribution.