davidusb-geek / emhass

emhass: Energy Management for Home Assistant, is a Python module designed to optimize your home energy interfacing with Home Assistant.
MIT License
263 stars 51 forks source link

CPU and memory usage increases when solver cannot find a solution #124

Closed sladey closed 3 months ago

sladey commented 8 months ago

I recently added new loads into my home that I am gradually bringing under control of EMHASS. In the meantime they are seen by EMHASS as uncontrolled loads (you can see these loads in the ML load forecast). This means there is reduced "space" to accomodate my existing controlled loads (EV). This means I am getting lots of "infeasible" runs of the solver. This seems to result in some resource leakage. I am see the CPU and memory usage of the EMHASS container increase over time and the same for the host machine (see screenshot below). This eventually leads to waitress task queuing and when the queue reaches a certain limit the CPU for the container gets pegged at 99%. While this is happening the solver is not bring run and so no HASS sensors are changing.

I am also getting an exception in this code that runs the solver which I cannot see the details of (because the code eats the exception) and I think speaks to something else going on, ie something different than just getting an infeasible result from the solver.

try: if self.lp_solver == 'PULP_CBC_CMD': opt_model.solve(PULP_CBC_CMD(msg=0)) elif self.lp_solver == 'GLPK_CMD': opt_model.solve(GLPK_CMD(msg=0)) elif self.lp_solver == 'COIN_CMD': opt_model.solve(COIN_CMD(msg=0, path=self.lp_solver_path)) else: self.logger.error("Invalid solver name passed") except Exception: self.logger.error("It was not possible to find a valid solver for Pulp package")

To Reproduce Repeated get infeasible results from the solver.

Expected behavior Managed CPU and memory usage

Screenshots

Screenshot 2023-10-24 at 6 13 13 pm

Home Assistant installation type

Your hardware

EMHASS installation type

paul-roev commented 8 months ago

I have simplified the setup be returning to the naive load for caster and two small deferrable loads. After 10-20 runs of the optimizer it seems to stop returning (I assume from the solver) and then queuing starts and CPU increases.

davidusb-geek commented 8 months ago

We need to understand why are you getting infeasible runs. That will just mean that there are too many constraints for a possible solution. Do you have a battery? What are your settings for treat_def_as_semi_cont and set_def_constant?

sladey commented 8 months ago

Hey David, here is my config and the MPC call with its runtime params.

` hass_url: empty long_lived_token: empty costfun: self-consumption logging_level: DEBUG optimization_time_step: 30 historic_days_to_retrieve: 5 method_ts_round: first set_total_pv_sell: false lp_solver: COIN_CMD lp_solver_path: /usr/bin/cbc set_nocharge_from_grid: false set_nodischarge_to_grid: false set_battery_dynamic: false battery_dynamic_max: 0.9 battery_dynamic_min: -0.9 load_forecast_method: naive sensor_power_photovoltaics: sensor.powerwall_solar_now_watts sensor_power_load_no_var_loads: sensor.load_with_no_var_loads number_of_deferrable_loads: 2 list_nominal_power_of_deferrable_loads:

Here is the MPC call. the load price forecast is just a fixed values that follow my TOU tariff.

{ "load_cost_forecast": {{(state_attr('input_text.load_cost_tariff', 'rates'))}}, "pv_power_forecast": {{ pv | tojson }}, "prediction_horizon": 48, "alpha": {{states('sensor.emhass_dynamic_alpha')}}, "beta": {{states('sensor.emhass_dynamic_beta')}}, "soc_init": {{states('sensor.powerwall_charge')|float(0)/100}}, "soc_final": 0.15, "num_def_loads": 2, "P_deferrable_nom": [6000, 4500], "treat_def_as_semi_cont": ["false","true"], "set_def_constant": ["false","false"], "def_total_hours": [ {{states('sensor.def_total_hours_ev')}}, {{states('sensor.def_total_hours_pool_filter')}}] }

the PV forecast is from Solcast. the alpha and beta values are as follows:

when battery SOC < 98%, high beta, low alpha. when SOC > 98% low beta, high alpha.

def_hours for the EV is typically < 4 def hours for the pool pump is 10 at the start of the day and reduces according to the actual time the pool pump is run during the day. it then resets to 10 after midnight each day.

The behaviour I see is:

  1. Lots of infeasible status's from the solver.
  2. CPU goes up
  3. MPC runs start to time out. I see "Perform an iteration of a naive MPC controller" in the logs but I don't see "Status: ....".
  4. Waitress queuing starts to happen.
  5. CPU gets to 99% and I need to restart the add on.

The other thing I sometimes see is "It was not possible to find a valid solver for Pulp package" and then additional python errors because many variables have the value of 'None'.

purcell-lab commented 8 months ago

Here is the MPC call. the load price forecast is just a fixed values that follow my TOU tariff.

Can you expand this MPC call inside the developer tools so we can see the expanded variables?

sladey commented 8 months ago

Here is some debug logging from the rest command:

Url: http://localhost:5000/action/naive-mpc-optim. Status code 500. Payload: b'{ "load_cost_forecast": [0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.46178, 0.46178, 0.46178, 0.46178, 0.46178, 0.46178, 0.46178, 0.46178, 0.46178, 0.46178, 0.46178, 0.46178, 0.24915, 0.24915, 0.24915, 0.24915, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.24915, 0.24915], "pv_power_forecast": [4918, 9760, 12225, 13392, 13985, 13321, 12737, 12665, 13219, 13585, 13849, 13968, 13611, 13012, 12101, 10823, 9486, 8035, 6354, 4366, 2271, 583, 73, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 208, 937, 2893, 4746, 6423, 7611, 8852, 10018, 10860, 11482, 12271, 12907, 13264, 13475, 13526, 13416, 12506, 10914, 9266, 7682, 6137, 4889, 3752, 2299, 826, 208, 29, 0, 0, 0, 0, 0, 0, 0, 0, 0], "prediction_horizon": 48, "alpha": 0.2, "beta": 0.8, "soc_init": 0.15, "soc_final": 0.15, "num_def_loads": 2, "P_deferrable_nom": [6000, 4500], "treat_def_as_semi_cont": ["false","true"], "set_def_constant": ["false","false"], "def_total_hours": [ 1, 6] }'

You can see that this call returned a 500 from the action. This is the other error I see (other than "status infeasible") in the EMHASS logs. I think there is something to learn by understanding the details of the exception being thrown when the solver is called. What the code does today is swallow the exception. I assume that the exceptions and the subsequent python errors are resulting in resources being used up in some way.

2023-11-03 08:00:00,230 - web_server - INFO - Setting up needed data 2023-11-03 08:00:00,232 - web_server - INFO - Retrieve hass get data method initiated... 2023-11-03 08:00:00,455 - web_server - INFO - Retrieving weather forecast data using method = list 2023-11-03 08:00:00,456 - web_server - INFO - Retrieving data from hass for load forecast using method = naive 2023-11-03 08:00:00,456 - web_server - INFO - Retrieve hass get data method initiated... 2023-11-03 08:00:00,735 - web_server - INFO - >> Performing naive MPC optimization... 2023-11-03 08:00:00,735 - web_server - INFO - Performing naive MPC optimization 2023-11-03 08:00:00,739 - web_server - INFO - Perform an iteration of a naive MPC controller 2023-11-03 08:00:02,144 - web_server - ERROR - It was not possible to find a valid solver for Pulp package 2023-11-03 08:00:02,144 - web_server - INFO - Status: Not Solved 2023-11-03 08:00:02,145 - web_server - WARNING - Cost function cannot be evaluated, probably None 2023-11-03 08:00:02,147 - web_server - ERROR - Exception on /action/naive-mpc-optim [POST] Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/flask/app.py", line 1455, in wsgi_app response = self.full_dispatch_request() File "/usr/local/lib/python3.9/dist-packages/flask/app.py", line 869, in full_dispatch_request rv = self.handle_user_exception(e) File "/usr/local/lib/python3.9/dist-packages/flask/app.py", line 867, in full_dispatch_request rv = self.dispatch_request() File "/usr/local/lib/python3.9/dist-packages/flask/app.py", line 852, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) File "/usr/local/lib/python3.9/dist-packages/emhass/web_server.py", line 204, in action_call opt_res = naive_mpc_optim(input_data_dict, app.logger) File "/usr/local/lib/python3.9/dist-packages/emhass/command_line.py", line 275, in naive_mpc_optim opt_res_naive_mpc = input_data_dict['opt'].perform_naive_mpc_optim( File "/usr/local/lib/python3.9/dist-packages/emhass/optimization.py", line 575, in perform_naive_mpc_optim self.opt_res = self.perform_optimization(df_input_data, P_PV.values.ravel(), P_load.values.ravel(), File "/usr/local/lib/python3.9/dist-packages/emhass/optimization.py", line 418, in perform_optimization opt_tp["P_grid"] = [P_grid_pos[i].varValue + P_grid_neg[i].varValue for i in set_I] File "/usr/local/lib/python3.9/dist-packages/emhass/optimization.py", line 418, in opt_tp["P_grid"] = [P_grid_pos[i].varValue + P_grid_neg[i].varValue for i in set_I] TypeError: unsupported operand type(s) for +: 'NoneType' and 'NoneType'

sladey commented 8 months ago

I have reduced to one load, here is the call to the optimiser and it results in an infeasible status.

Url: http://localhost:5000/action/naive-mpc-optim. Status code: 201. Payload: b'{ "load_cost_forecast": [0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.46178, 0.46178, 0.46178, 0.46178, 0.46178, 0.46178, 0.46178, 0.46178, 0.46178, 0.46178, 0.46178, 0.46178, 0.24915, 0.24915, 0.24915, 0.24915, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.24915, 0.24915, 0.24915, 0.24915], "pv_power_forecast": [3270, 13392, 13985, 13321, 12737, 12665, 13219, 13585, 13849, 13968, 13611, 13012, 12101, 10823, 9486, 8035, 6354, 4366, 2271, 583, 73, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 208, 937, 2893, 4746, 6423, 7611, 8852, 10018, 10860, 11482, 12271, 12907, 13264, 13475, 13526, 13416, 12506, 10914, 9266, 7682, 6137, 4889, 3752, 2299, 826, 208, 29, 0, 0, 0, 0, 0, 0, 0, 0, 0], "prediction_horizon": 48, "alpha": 0.2, "beta": 0.8, "soc_init": 0.15, "soc_final": 0.15, "num_def_loads": 1, "P_deferrable_nom": [6000], "treat_def_as_semi_cont": ["false"], "set_def_constant": ["false"], "def_total_hours": [ 1] }'

sladey commented 8 months ago

With the above set of def loads I am getting an optimal result. So I guess the main issues here is why the solver sometimes times out, what is the exception that the solver sometimes throws and does that lead to resource leakage. if you want to send a build with more debug info I am happy to install it. I can build an image locally from a dev branch if required.

sladey commented 8 months ago

Here is my latest run with two def loads and low def_total_hours:

Success. Url: http://localhost:5000/action/naive-mpc-optim. Status code: 201. Payload: b'{ "load_cost_forecast": [0.46178, 0.46178, 0.46178, 0.46178, 0.24915, 0.24915, 0.24915, 0.24915, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.19239, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915, 0.24915], "pv_power_forecast": [475, 393, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 193, 901, 2812, 4591, 6245, 7639, 9116, 10439, 11362, 11894, 12670, 13400, 13876, 14144, 14151, 13879, 13175, 12127, 10955, 9656, 8010, 6414, 4835, 2948, 1177, 264, 29, 0, 0, 0, 0, 0, 0, 0, 0, 0], "prediction_horizon": 48, "alpha": 0.2, "beta": 0.8, "soc_init": 0.75, "soc_final": 0.15, "num_def_loads": 2, "P_deferrable_nom": [4000,4000], "treat_def_as_semi_cont": ["false","false"], "set_def_constant": ["false","false"], "def_total_hours": [ 2, 1] }'

The EMHASS log file shows that it cannot find a feasible solution. Yet if you look at the graph of the published forecasts, it looks very feasible to me.

Screenshot 2023-11-03 at 6 32 44 pm

You can 4x30min blocks for the EV and 2x30min blocks for the pool pump.

sladey commented 7 months ago

@davidusb-geek how can I help you investigate this resource usage issue? With what seems like very feasible scenarios (in terms of PV forecast, un-controllable loads, and def_hours for each controllable load) I keep getting infeasible results and I need to restart the add-on every couple of days because it starts using 99% CPU and eventually calls to the MPC end point do not return.

davidusb-geek commented 7 months ago

Hi. Like said before this is a very strange issue. One thing that you can do is to carefully inspect and analyze the results thrown by EMHASS when obtaining un feasible situations. Copy the result table contents from the webui into an Excel spreadsheet and analyze row by row to look for any data condition that would lead you to something that is just not numerically possible. A second possible thing to try will be a different solver. You can try going into the docker container of the add-on and just install a new solver. Then provide the correct path to the new solver on the add-on configuration.

sladey commented 7 months ago

Hey David, I have tried several solvers since there are installed in the docker image already. The behaviour is the same. I'll do some excel number crunching.

sladey commented 7 months ago

@davidusb-geek when I put the optimizer results in excel here is what I see consistently:

P_PV + P_GRID + P_BAT - P_LOAD = (P_DEF0 + P_DEF1) + 1

Is this what you expect? It seems correct to me. It shows that generation plus grid plus storage is just over the controllable and uncontrollable loads.

davidusb-geek commented 7 months ago

Yes that seems correct. But your CPU problem may be related to some situations that are forcing an unfeasible condition. When that condition happen that Excel check should be failing. The hope is to find why it is failing. Maybe some unexpected values on the input data? Are you using nodered? I have seen in the past that it can be a source of problems...

sladey commented 7 months ago

When I collected those stats in excel, the solver was returning "infeasible". I am not using node-red. I am trying to keep things as simple as possible with the naive load forecast method and solcast providing the PV forecast. Do you think it is worth me running outside of HA? ie do a docker based remote setup on my laptop?

davidusb-geek commented 7 months ago

Ok I understand. We need to inspect what is causing the infeasible condition. Yes you can test all that locally, it will give you the liberty to directly inspect what is the problem. You can print the LP problem to check the constraints. Then add relaxation to see what changed to obtain a feasible status. Here is an example on how to do this: https://stackoverflow.com/questions/76158127/linear-relaxation-infeasible-in-python-using-pulp-but-still-get-an-objective-va

You can install EMHASS using the legacy method not the docker option, it will give you full control to inspect and modify the code to debug. Follow the instructions in the Readme, you will only need a PC with Python and you will need to create a virtual environment with the needed libraries. Its all in the Readme.

sladey commented 6 months ago

I have got things running in legacy mode. As expected the results are "optimal". I have tried to get the setup in legacy mode as close to the HA mode as possible. More work needed :)

davidusb-geek commented 6 months ago

Ok, yes try to get the same configuration. You said you tried different solvers on your HA instance. One guess is some incompatibility between solvers and your hardware. What hardware architecture is your HA instance using?

sladey commented 6 months ago

David, some success. I am now able to get reliable Optimal results on my HA install. I stripped back the configuration:

I am using the COIN / cbc solver on both installs.

Oddly there are still differences between running the same inputs and the result on HA (which happens to be optimal) and on my mac (which is infeasible). Both using the same solver. Would you expect this?

davidusb-geek commented 6 months ago

Oddly there are still differences between running the same inputs and the result on HA (which happens to be optimal) and on my mac (which is infeasible). Both using the same solver. Would you expect this?

I don't understand this bit. Your results on HA are using the add-on on a RPi4 and status is optimal. While on your mac you installed the legacy method and the result is infeasible. Is it that?

paul-roev commented 6 months ago

Yes that is the situation. I'll do more tests today but I am seeing different results for very very similar inputs.

The good thing is that by reducing the complexity of the inputs (I think soc_init and soc_final may be the important changes) I am back to getting optimal solutions.

David now that I have a local dev environment I might add some more error logging to help HA users debug problems with input data. Is that ok?

davidusb-geek commented 6 months ago

Sure any improvements are welcomed ;-)

davidusb-geek commented 4 months ago

Hi, did you managed to overcome these issues?

davidusb-geek commented 3 months ago

Feel free to reopen if needed.