[x] remove duplication and change to vector-like (Discrete, list, ...) data structure to speed up
[x] self.dones termination can only be completed at the end of the cycle due to the restriction of parallel environment now comes to: all variables are updated in each step of an agent (we are going to create training parts by ourselves)
[x] charging should not happen when at full capacity
[x] when in low-battery, only done for one agent and continue the process until all responded
[x] need to change the self.responded data update, rewards should not be added to vehicle even if the demand is already responded by other vehicles.
[x] better to keep a queue for CS, cannot accommodate too many EVs -- if an EV stepped into a fully-loaded CS, it should wait (still have a reward discount) and would not be charged until one moved.
TBD in next PR:
[ ] model trained should be adapted to our own algorithm
[ ] get the decreasing loss curve and increasing cumulative reward curve (in the ideal case)
[ ] reward curves for all agents with different electricity-cost settings ("stronger works more")
[ ] make some running examples after training based on the policy ("stronger works more" and small-capacity EV not responded to long-distance response)
Work in this PR:
self.dones termination can only be completed at the end of the cycle due to the restriction of parallel environmentnow comes to: all variables are updated in each step of an agent (we are going to create training parts by ourselves)TBD in next PR: