LucasAlegre / sumo-rl

Reinforcement Learning environments for Traffic Signal Control with SUMO. Compatible with Gymnasium, PettingZoo, and popular RL libraries.
https://lucasalegre.github.io/sumo-rl
MIT License
698 stars 191 forks source link

Multi agent and reward error #88

Closed thoithoi58 closed 2 years ago

thoithoi58 commented 2 years ago

Hi, I was trying to use A2C on the ingolstadt21 map, but when I set the single_agent=False. I'd receive this error.

Traceback (most recent call last):
  File "/home/ubuntu/Videos/test.py", line 32, in <module>
    model.learn(total_timesteps=10000000)
  File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/stable_baselines3/a2c/a2c.py", line 200, in learn
    reset_num_timesteps=reset_num_timesteps,
  File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 250, in learn
    continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
  File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 168, in collect_rollouts
    obs_tensor = obs_as_tensor(self._last_obs, self.device)
  File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/stable_baselines3/common/utils.py", line 448, in obs_as_tensor
    return th.as_tensor(obs).to(device)
TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

And also, i notice that the reward returned in the csv file by argument out_csv_name is always 0. This only happens when I use the RESCO map. Is there anything wrong with the route or net file?

LucasAlegre commented 2 years ago

Hi, stable-baselines3 is for single-agent algorithms, so it would not work directly by setting single_agent=False in the environment.

Regarding the reward, perhaps you are setting the simulation time in a period where there are no vehicles defined in the route file. See https://github.com/LucasAlegre/sumo-rl/blob/master/sumo_rl/environment/resco_envs.py

thoithoi58 commented 2 years ago

Hi, I appreciate you reply this fast :). However I’ve already set the begin time and num seconds to match the net file config, so there’s maybe error somewhere else.

thoithoi58 commented 2 years ago

Hi, stable-baselines3 is for single-agent algorithms, so it would not work directly by setting single_agent=False in the environment.

Regarding the reward, perhaps you are setting the simulation time in a period where there are no vehicles defined in the route file. See https://github.com/LucasAlegre/sumo-rl/blob/master/sumo_rl/environment/resco_envs.py

Hey, I'm just tested A3C again with rllib library. And again, the reward returned is always 0. I used the exact code from the a3c_4x4grid.py, just change the map to RESCO benchmark. Can you help?

LucasAlegre commented 2 years ago

Sure, can you share the code?

thoithoi58 commented 2 years ago

Hi, I'm actually having 2 issues now. First when I use this code (just a3c_4x4grid.py with a few changes) and seem like the reward return is stuck at -1.9, here is the reward I get at 9th episode.

Second, when I try to use RESCO map. It return this error

File "/home/ubuntu/Videos/resco.py", line 28, in <module>
    num_seconds=61200))
  File "/home/ubuntu/miniconda3/envs/sumo/lib/python3.7/site-packages/ray/rllib/env/wrappers/pettingzoo_env.py", line 86, in __init__
    "Observation spaces for all agents must be identical. Perhaps "
AssertionError: Observation spaces for all agents must be identical. Perhaps SuperSuit's pad_observations wrapper can help (useage: `supersuit.aec_wrappers.pad_observations(env)`

Code is here. I've already change observation and action space to match the environment. The configuration for the trainer here.

LucasAlegre commented 2 years ago

Hi, I'm actually having 2 issues now. First when I use this code (just a3c_4x4grid.py with a few changes) and seem like the reward return is stuck at -1.9, here is the reward I get at 9th episode.

It may be that the lanes are all saturated, and therefore the reward is always the minimum possible. Probably after some episodes after the agents have learned better policies it should improve.

Second, when I try to use RESCO map. It return this error

File "/home/ubuntu/Videos/resco.py", line 28, in <module>
    num_seconds=61200))
  File "/home/ubuntu/miniconda3/envs/sumo/lib/python3.7/site-packages/ray/rllib/env/wrappers/pettingzoo_env.py", line 86, in __init__
    "Observation spaces for all agents must be identical. Perhaps "
AssertionError: Observation spaces for all agents must be identical. Perhaps SuperSuit's pad_observations wrapper can help (useage: `supersuit.aec_wrappers.pad_observations(env)`

Code is here. I've already change observation and action space to match the environment. The configuration for the trainer here.

In that environment the agents are heterogeneous (have different observation spaces), and rllib can't handle that. As mentioned in the warning, you can try the SuperSuit wrapper (https://github.com/Farama-Foundation/SuperSuit).

thoithoi58 commented 2 years ago

Hi, I tried your advice with SuperSuit and it ran flawlessly. However, even after 16 hours of training, the reward returned is still around 0 (here is the latest reward file). Also, when I debugged it, it showed that only the first traffic-light's reward is calculated and the rest is None, even the ingolstadt21 map has a total of 21 traffic lights. Here is my code, can you take a look?

LucasAlegre commented 2 years ago

Hi,

Rewards around 0.0 is actually very good behavior. See the reward function, if no vehicles are waiting, then the total waiting time keeps unchanged and W_t+1 - W_t = 0.

How did you observed the rewards being None?

thoithoi58 commented 2 years ago

I agree with you on the 0 reward. But the problem here is the first and the latest episode are almost identical. The reward has been 0 from the first episode with little fluctuations till the end. In theory the agent can’t find good optimal policy in only first episode, right? :)

LucasAlegre commented 2 years ago

Hi, I have just ran the simulation and there are many congestioned intersections in the first episode. Perhaps you are observing the reward of an intersection in which very few vehicles pass through. If you observe the rewards of the other intersections they should be negative.

thoithoi58 commented 2 years ago

Hi, so how can I change the observation to other intersection. Or maybe all intersection and accumulate their reward. The documentation provided is not very specific :)

thoithoi58 commented 2 years ago

Figure_1 Sorry to bother you, but this is what I got after 16 hours of training use the plot.py file. But it is suppose to go down, right? And also, I found this issue, which their best reward of all training runs is all negative. But the default reward is W_t - W_t+1, then it must be positive if at time-step t+1 you get a better policy. Or maybe the pytorch/pettingzoo/rllib... automatically put a negative sign before reward function to minimize it?. Is there anything I misunderstood? thanks :)

LucasAlegre commented 2 years ago

Hi, so how can I change the observation to other intersection. Or maybe all intersection and accumulate their reward. The documentation provided is not very specific :)

Hi, the environment step() returns dictionaries, where each entry corresponds to the state/reward of a different intersection.

LucasAlegre commented 2 years ago

Figure_1 Sorry to bother you, but this is what I got after 16 hours of training use the plot.py file. But it is suppose to go down, right? And also, I found this issue, which their best reward of all training runs is all negative. But the default reward is W_t - W_t+1, then it must be positive if at time-step t+1 you get a better policy. Or maybe the pytorch/pettingzoo/rllib... automatically put a negative sign before reward function to minimize it?. Is there anything I misunderstood? thanks :)

Are you averaging the results of all csvs? To see whether it is improving you should compare the results of different episodes/runs. W_t is the total accumulated waiting time, so it never decreases between time steps (unless vehicles are leaving the intersection). It is possible to have some positive rewards, but in general the rewards will be negative. No RL libraries change the sign of the reward.