The run_learning function in loader csv currently kills the entire simulation and world and restarts it for the DRL part. This leads to the necessity of storing a bunch of parameters which are necessary for the continuous learning process across multiple runs of the same simulation horizon.
We discussed the option of resetting the time in mango agents. This is technically possible but would require a rescheduling of a bunch of tasks such as the ones from the markets. This is done in the world reset function which is ultimately called at the end of each simulation and triggers a new set-up. Every solution that does not use the reset function (which is used now as well) would lead to some sort of "Frankenstein" solution with old and new agents. This is destined to come with new unforeseeable problems, which cannot be justified by the learning loop being messy.
Instead we agreed on the following:
collect all intre-episodic data in one dict of the learning_unit and give it to the learning unit at the beginning of each episode (aka simulation run)
always initialize the learning role also if we only have an evaluation run, this makes the explicit storage of actors on disc obsolet
store the last actors and critics in the learning role, DO NOT write them onto the disc to save space and avoid reloading
only wrote best eval strategy to disc and the very last current loiciy after the training is finished
The run_learning function in loader csv currently kills the entire simulation and world and restarts it for the DRL part. This leads to the necessity of storing a bunch of parameters which are necessary for the continuous learning process across multiple runs of the same simulation horizon.
We discussed the option of resetting the time in mango agents. This is technically possible but would require a rescheduling of a bunch of tasks such as the ones from the markets. This is done in the world reset function which is ultimately called at the end of each simulation and triggers a new set-up. Every solution that does not use the reset function (which is used now as well) would lead to some sort of "Frankenstein" solution with old and new agents. This is destined to come with new unforeseeable problems, which cannot be justified by the learning loop being messy.
Instead we agreed on the following: