Closed thoithoi58 closed 2 years ago
Hi, stable-baselines3 is for single-agent algorithms, so it would not work directly by setting single_agent=False in the environment.
Regarding the reward, perhaps you are setting the simulation time in a period where there are no vehicles defined in the route file. See https://github.com/LucasAlegre/sumo-rl/blob/master/sumo_rl/environment/resco_envs.py
Hi, I appreciate you reply this fast :). However I’ve already set the begin time and num seconds to match the net file config, so there’s maybe error somewhere else.
Hi, stable-baselines3 is for single-agent algorithms, so it would not work directly by setting single_agent=False in the environment.
Regarding the reward, perhaps you are setting the simulation time in a period where there are no vehicles defined in the route file. See https://github.com/LucasAlegre/sumo-rl/blob/master/sumo_rl/environment/resco_envs.py
Hey, I'm just tested A3C again with rllib library. And again, the reward returned is always 0. I used the exact code from the a3c_4x4grid.py
, just change the map to RESCO benchmark. Can you help?
Sure, can you share the code?
Hi, I'm actually having 2 issues now. First when I use this code (just a3c_4x4grid.py
with a few changes) and seem like the reward return is stuck at -1.9, here is the reward I get at 9th episode.
Second, when I try to use RESCO map. It return this error
File "/home/ubuntu/Videos/resco.py", line 28, in <module>
num_seconds=61200))
File "/home/ubuntu/miniconda3/envs/sumo/lib/python3.7/site-packages/ray/rllib/env/wrappers/pettingzoo_env.py", line 86, in __init__
"Observation spaces for all agents must be identical. Perhaps "
AssertionError: Observation spaces for all agents must be identical. Perhaps SuperSuit's pad_observations wrapper can help (useage: `supersuit.aec_wrappers.pad_observations(env)`
Code is here. I've already change observation and action space to match the environment. The configuration for the trainer here.
Hi, I'm actually having 2 issues now. First when I use this code (just
a3c_4x4grid.py
with a few changes) and seem like the reward return is stuck at -1.9, here is the reward I get at 9th episode.It may be that the lanes are all saturated, and therefore the reward is always the minimum possible. Probably after some episodes after the agents have learned better policies it should improve.
Second, when I try to use RESCO map. It return this error
File "/home/ubuntu/Videos/resco.py", line 28, in <module> num_seconds=61200)) File "/home/ubuntu/miniconda3/envs/sumo/lib/python3.7/site-packages/ray/rllib/env/wrappers/pettingzoo_env.py", line 86, in __init__ "Observation spaces for all agents must be identical. Perhaps " AssertionError: Observation spaces for all agents must be identical. Perhaps SuperSuit's pad_observations wrapper can help (useage: `supersuit.aec_wrappers.pad_observations(env)`
Code is here. I've already change observation and action space to match the environment. The configuration for the trainer here.
In that environment the agents are heterogeneous (have different observation spaces), and rllib can't handle that. As mentioned in the warning, you can try the SuperSuit wrapper (https://github.com/Farama-Foundation/SuperSuit).
Hi, I tried your advice with SuperSuit and it ran flawlessly. However, even after 16 hours of training, the reward returned is still around 0 (here is the latest reward file). Also, when I debugged it, it showed that only the first traffic-light's reward is calculated and the rest is None, even the ingolstadt21 map has a total of 21 traffic lights. Here is my code, can you take a look?
Hi,
Rewards around 0.0 is actually very good behavior. See the reward function, if no vehicles are waiting, then the total waiting time keeps unchanged and W_t+1 - W_t = 0.
How did you observed the rewards being None?
I agree with you on the 0 reward. But the problem here is the first and the latest episode are almost identical. The reward has been 0 from the first episode with little fluctuations till the end. In theory the agent can’t find good optimal policy in only first episode, right? :)
Hi, I have just ran the simulation and there are many congestioned intersections in the first episode. Perhaps you are observing the reward of an intersection in which very few vehicles pass through. If you observe the rewards of the other intersections they should be negative.
Hi, so how can I change the observation to other intersection. Or maybe all intersection and accumulate their reward. The documentation provided is not very specific :)
Sorry to bother you, but this is what I got after 16 hours of training use the plot.py file. But it is suppose to go down, right? And also, I found this issue, which their best reward of all training runs is all negative. But the default reward is W_t - W_t+1, then it must be positive if at time-step t+1 you get a better policy. Or maybe the pytorch/pettingzoo/rllib... automatically put a negative sign before reward function to minimize it?. Is there anything I misunderstood? thanks :)
Hi, so how can I change the observation to other intersection. Or maybe all intersection and accumulate their reward. The documentation provided is not very specific :)
Hi, the environment step() returns dictionaries, where each entry corresponds to the state/reward of a different intersection.
Sorry to bother you, but this is what I got after 16 hours of training use the plot.py file. But it is suppose to go down, right? And also, I found this issue, which their best reward of all training runs is all negative. But the default reward is W_t - W_t+1, then it must be positive if at time-step t+1 you get a better policy. Or maybe the pytorch/pettingzoo/rllib... automatically put a negative sign before reward function to minimize it?. Is there anything I misunderstood? thanks :)
Are you averaging the results of all csvs? To see whether it is improving you should compare the results of different episodes/runs. W_t is the total accumulated waiting time, so it never decreases between time steps (unless vehicles are leaving the intersection). It is possible to have some positive rewards, but in general the rewards will be negative. No RL libraries change the sign of the reward.
Hi, I was trying to use A2C on the ingolstadt21 map, but when I set the
single_agent=False
. I'd receive this error.And also, i notice that the reward returned in the csv file by argument
out_csv_name
is always 0. This only happens when I use the RESCO map. Is there anything wrong with the route or net file?