Closed nick-harder closed 1 year ago
Base: Nicks Implementation of flexRL
since integrated in flexable already alot of code can be reused
Uses a MATD3 algorithm TD3 is an off-policy algorithm TD3 can only be used for environments with continuous action spaces Further development of DDPG to tackel disadvantages of overestimating Q-values
Observation space = System Observation: res_load[t-forecast_len:t], res_load_forecast[t:t+forecast_len], price[t-forecast_len:t], price_forecast[t:t+forecast_len] and Unit Observation: total_scaled_capacity, scaled_marginal_cost
Implementation Decisions
- Dynamic learning algorithm specififcation based on config choice of algorithm the init and update policy function should be set rest of code should work regardless of the algorithm
- Resuse written replay buffer in common of assume
- Coordination GPU & CPU as the choice of an action, alias applying a bidding strategy, is done by a neuronal net of the learning agent it runs on the GPU, the rest pf the simulation, however, is done on the CPU sending stuff from the GPU to the CPU takes quite some time, hence it is currently handeled in Nicks code, that all operational windows of all agents are collected and written from the GPU to the CPU
One unit operator per GPU If learning is activated we have only one specific leanring unit operator that handels all learning units meaning the column unit operator is ignored if learning = 1 in the formulate bids from the operational window the transformation from GPU to CPU is then done
Detach from unit Opertaor and integrate intermediate step handeling that (maybe later) we want to avoide more unneccesarry messaging, hence collecting and coordinating info is done in supplementary function which coellect data Discussion Points: How does function now if all data is received, if we have asynchronous data or can dynamically subscribe to markets? Could it be done in the market? Yes as long as we have only one GPU, so how to handle multiple GPUs?
- Forecast Generation for Observation Space in the future we want to have forecasting role handling the different needed forecasts and sending it to the agents, so that we can have diverging forecasts (see respective issue) first we want to calcuate the expected merit order price according to input files and read it as an observation similar to the ahndling of the fuel_prices residual load forecast can be taken form smard/Entso-e tranparency or just perfect foresight
General
Learning role
Unit
Unit Operator
RL Strategy
RL Algorithms
Relevant progress was made in #130 Kim is currently working on a functioning sampling method including the MATD3.
Running version of the learning is in main. Yet, the learning itself, especially the update function does not work. I would suggest the following steps for the further process:
Visualisation
Get Learning to learn
collect_initial_experience
, either it is not on at all or it is never turned of
scenario_loader
or after 4 hours as in learning_role
?Clean Learning
Implement Evaluation and saving
@maurerle this one is done as well or?
Only the tests are missing which are also part of #143 - so yes, we can close this. I am currently doing the rl tests
Start with the implementation of the learning functions