feat: petting zoo style env

Work in this PR:

[x] remove duplication and change to vector-like (Discrete, list, ...) data structure to speed up
[x] ~~self.dones termination can only be completed at the end of the cycle due to the restriction of parallel environment~~ now comes to: all variables are updated in each step of an agent (we are going to create training parts by ourselves)
[x] charging should not happen when at full capacity
[x] shrink state space of battery level and tune it to fully discrete (https://github.com/LovelyBuggies/sumo-gym/pull/55#issuecomment-1013979638)
[x] when in low-battery, only done for one agent and continue the process until all responded
[x] need to change the self.responded data update, rewards should not be added to vehicle even if the demand is already responded by other vehicles.
[x] better to keep a queue for CS, cannot accommodate too many EVs -- if an EV stepped into a fully-loaded CS, it should wait (still have a reward discount) and would not be charged until one moved.

TBD in next PR:

[ ] model trained should be adapted to our own algorithm
[ ] get the decreasing loss curve and increasing cumulative reward curve (in the ideal case)
[ ] reward curves for all agents with different electricity-cost settings ("stronger works more")
[ ] make some running examples after training based on the policy ("stronger works more" and small-capacity EV not responded to long-distance response)

LovelyBuggies / sumo-gym