Implement learning functionalities

nick-harder commented 1 year ago

Start with the implementation of the learning functions

kim-mskw commented 1 year ago

Integration Learning Discussion 22.05

Base: Nicks Implementation of flexRL

since integrated in flexable already alot of code can be reused
Uses a MATD3 algorithm TD3 is an off-policy algorithm TD3 can only be used for environments with continuous action spaces Further development of DDPG to tackel disadvantages of overestimating Q-values
Observation space = System Observation: res_load[t-forecast_len:t], res_load_forecast[t:t+forecast_len], price[t-forecast_len:t], price_forecast[t:t+forecast_len] and Unit Observation: total_scaled_capacity, scaled_marginal_cost

Implementation Decisions

- Dynamic learning algorithm specififcation based on config choice of algorithm the init and update policy function should be set rest of code should work regardless of the algorithm

- Resuse written replay buffer in common of assume

- Coordination GPU & CPU as the choice of an action, alias applying a bidding strategy, is done by a neuronal net of the learning agent it runs on the GPU, the rest pf the simulation, however, is done on the CPU sending stuff from the GPU to the CPU takes quite some time, hence it is currently handeled in Nicks code, that all operational windows of all agents are collected and written from the GPU to the CPU

One unit operator per GPU If learning is activated we have only one specific leanring unit operator that handels all learning units meaning the column unit operator is ignored if learning = 1 in the formulate bids from the operational window the transformation from GPU to CPU is then done
Detach from unit Opertaor and integrate intermediate step handeling that (maybe later) we want to avoide more unneccesarry messaging, hence collecting and coordinating info is done in supplementary function which coellect data Discussion Points: How does function now if all data is received, if we have asynchronous data or can dynamically subscribe to markets? Could it be done in the market? Yes as long as we have only one GPU, so how to handle multiple GPUs?

- Forecast Generation for Observation Space in the future we want to have forecasting role handling the different needed forecasts and sending it to the agents, so that we can have diverging forecasts (see respective issue) first we want to calcuate the expected merit order price according to input files and read it as an observation similar to the ahndling of the fuel_prices residual load forecast can be taken form smard/Entso-e tranparency or just perfect foresight

kim-mskw commented 1 year ago

Architecture Discussion to accommodate learning:

General

define path for pre-learned staregie (=models) in config file
if learning is not activated in the config file and there are no models in the file give an error
automatically generate a file that has information about the trained model, like hyper parameters etc in it
use new Grafana dashboard to observe learning process

Learning role

when learning is active and hence models are trained, this role is initiated and every rl process then is in learning mode (in comparison to the mode where a pre-trained strategy is just used)
learning agent will run on a different process as the output role
schedules the policy update as a recurrent event

Unit

unit should stay as a generic unit that just pushes the observation from the unit operator through to the strategy, this way the unit can stay the same no matter which strategy is used (rl or not)

Unit Operator

the unit operator gets the price and demand forecasts from the forecast role and pushes that through the unit to the strategy
the unit operator must collect the actions of all RL units and push them from the GPU to CPU
this is implemented by two functions formulate bids for the rl units in its portfolio and formulate bids for conventional units
the former then checks if cuda exists and if not, no copying from GPU is necessary

RL Strategy

accommodates all the learning, meaning the call of the different learning algorithms and adds the unit-specific observations to the observation space

RL Algorithms

the mainly differentiate in the policy update function
they have a different folder and are called in the RL Strategy
leave them interchangeable

maurerle commented 1 year ago

Relevant progress was made in #130 Kim is currently working on a functioning sampling method including the MATD3.

kim-mskw commented 1 year ago

Running version of the learning is in main. Yet, the learning itself, especially the update function does not work. I would suggest the following steps for the further process:

Visualisation

[x] hack the dashboard plots in there to allow proper investigation of learning #142 (KA)

Get Learning to learn

[x] debug collect_initial_experience, either it is not on at all or it is never turned of
- unclear from code - should it be turned of after 4th episode as in scenario_loader or after 4 hours as in learning_role?
- set to 4th episode
[x] double-check policy update
[x] fix save and load policies. This definitely does not work yet since the last run of the process (with learning mode off) is terrible
[x] tune learning params (maybe readd some of the observation dimension, we deleted some compared to flexable)
[x] store critics on disk similar to actor (not buffer too big and nasty) (@nick-harder)
[x] test multiple agents (@nick-harder )

Clean Learning

[x] clean learning role code (@nick-harder)
[x] add noise as hyperparameter in config (KA)
[x] delete buffer load and save (@nick-harder)
[x] delte next_observation in bufer, because we sample that over index
[x] split write rl_params for output and buffer
[x] make train loop in normal world to omit rl_exmaples file

Implement Evaluation and saving

[x] make difference from storing ciritics and actor latest and best (lowest regret , maximize profit) (KA)
[x] included evaluation after x episodes, conditionally store actors an critics if the policies are better than before (KA)
[ ] tests (KA)
[x] remove reset function

nick-harder commented 1 year ago

@maurerle this one is done as well or?

maurerle commented 1 year ago

Only the tests are missing which are also part of #143 - so yes, we can close this. I am currently doing the rl tests

assume-framework / assume

Implement learning functionalities #3

Integration Learning Discussion 22.05

Architecture Discussion to accommodate learning: