Scaling to multiple agents

Achin17 commented 4 years ago

Thanks for creating this easy to use environment for urban scenarios. I wanted to use this environment for multi agent learning. Currently, single agent learning is supported. Are there any plans for scaling it up for multi agents?

eleurent commented 4 years ago

Hi @Achin17 I'm not really familliar with the multi-agent setting. It's easy to create a scene with many controllable vehicles, but I'm not sure how to interface the environment to handle multiple observations/actions. Does it simply amount to having a product spaces of N observations, and N actions concatenated ?

Achin17 commented 4 years ago

For now, yes I am looking for a product spaces of N observations, and N actions concatenated. I think I have found a workaround for this. Thanks!

kargarisaac commented 3 years ago

For now, yes I am looking for a product spaces of N observations, and N actions concatenated. I think I have found a workaround for this. Thanks!

Sorry to ask here in a closed issue. Did you do multi-agent setting? I'm looking for the same thing.

eleurent commented 3 years ago

Hello Isaac, (nice to see you here, I've been following your twitter and medium post with great interest!) No, I haven't investigated the multi-agent setting yet myself, but I am very open to adding the support, since it appears to be a popular request.

kargarisaac commented 3 years ago

Hello Isaac, (nice to see you here, I've been following your twitter and medium post with great interest!) No, I haven't investigated the multi-agent setting yet myself, but I am very open to adding the support, since it appears to be a popular request.

Thank you. So I can start to work on this. I really couldn't find a good simulator for self-driving cars with such diverse scenarios. I think I can consider this simulator as my main env for my research.

To discuss about it and add this feature, do you think we can continue here or create a new issue?

eleurent commented 3 years ago

Yes, let's simply reopen this issue.

The way I see it, the main steps necessary for this feature are:

[x] Replace the ego-vehicle field env.vehicle by a list of vehicles env.controlled_vehicles.
[x] Define a MultiAgentObservation type and MultiAgentAction type, which takes a standard observation/action configuration as input and produces the corresponding observation/action for each controlled vehicle. The corresponding spaces could be defined either as tuples or as dicts (which map vehicle indexes to obs or actions)
[x] Update the vehicle creation methods in each environment to accept an arbitrary (<= vehicle count) number of controlled vehicles. Replace their behavioural model (e.g. IDMVehicle type) by the vehicle class of the corresponding action space (e.g. MDPVehicle).

Do you have some additional insights or requirements?

kargarisaac commented 3 years ago

Thank you for the mentioned features. I think for vehicles it is good to have the option to select the agents to be rule-based (IDMVehicle) or to be controlled by a trained policy (MDPVehicle). Maybe some of them are rule-based and some not.

1- One question about IDMVehicle is that I see some collision between agents, is there any way to reduce these collisions?

2-The other question that I have is: Do the agents act at the same time-step? For example in petting zoo the agents act sequentially and are not completely synced, which I think is not good. I think we don't have this problem in highway env, right?

3- Another question: What is the algorithm you used for the intersection gif in the readme? Is it stable baseline DQN? Or it is the social attention method from your paper?

4- The other point about multi-agent RL is when to terminate the episode?!

We can consider the time limitation for one episode and when each agent reaches its target we reset it to the initial location, or stop it there or go out of the scene and just set the existence flag in other agents' observation to zero.
We can also reset the env when one learning agent has collision or go out of the road (which I think is not the case here but for continuous action space it is possible to go out of the road). Because we want all the agents to reach their target safely and if one of them collides, we reset the episode.

5- About spawning vehicles, I try not to spawn any other vehicle but I cannot. I set initial_vehicle_count to zero, or one or two, but again I see a lot of other vehicles. I also don't know some equations in the _spawn_vehicle() function. The longitudinal value that you pass into that function for example.

6- The other question is about the step() function. When I add another ego_vehicle, the episode length becomes very very short, and I need to change the duration from 13 to sth like 300 to be able to finish the episode and reach the target. What happens in this situation? Isn't one step of all vehicles synced? or maybe I'm doing sth wrong.

eleurent commented 3 years ago

1- One question about IDMVehicle is that I see some collision between agents, is there any way to reduce these collisions?

The IDM/Mobil are only suited for straight roads, and the scheduling/avoidance mechanism I added for intersections is quite rudimentary, see #76 If collisions are too problematic, a smarter scheduling method will have to be implemented (e.g. traffic lights).

2-The other question that I have is: Do the agents act at the same time-step? For example in petting zoo the agents act sequentially and are not completely synced, which I think is not good. I think we don't have this problem in highway env, right?

Yes,

env.step() starts with first loop over vehicles to run their internal decision making / control process, through vehicle.act().
Once all steering/throttle controls have been set, the vehicle states are updated through a second loop of (decoupled) vehicle.step().

3- Another question: What is the algorithm you used for the intersection gif in the readme? Is it stable baseline DQN? Or it is the social attention method from your paper?

I think it was the social attention policy. I'm sure it can be implemented in stable baselines DQN with a custom policy architecture.

4- The other point about multi-agent RL is when to terminate the episode?! We can consider the time limitation for one episode and when each agent reaches its target we reset it to the initial location, or stop it there or go out of the scene and just set the existence flag in other agents' observation to zero.

I think simply having a time limitation for termination is great. I like your proposals when agents reach their targets, it can also be the case that agents do not have targets and simply keep driving (depending on the environment of course). And if an agent has a collision, it can get a terminal penalty and other agents will be able to keep training for the rest of the episode. But sure, we may as well reset the whole env upon collision of any agent, why not.

5- About spawning vehicles, I try not to spawn any other vehicle but I cannot. I set initial_vehicle_count to zero, or one or two, but again I see a lot of other vehicles. I also don't know some equations in the _spawn_vehicle() function. The longitudinal value that you pass into that function for example.

Sorry for the insufficient documentation here. initial_vehicle_count controls how many vehicles are on the scene initially, but then other are created by _spawn_vehicle() at each step() with a given spawn_probability. You may set this configuration to a lower value. The longitudinal parameter controls the position of the created vehicles on their lane (higher is closer to the intersection center).

6- The other question is about the step() function. When I add another ego_vehicle, the episode length becomes very very short, and I need to change the duration from 13 to sth like 300 to be able to finish the episode and reach the target. What happens in this situation? Isn't one step of all vehicles synced? or maybe I'm doing sth wrong.

I am not sure what you did, actually, but it seems to have caused some strange side effect? Maybe you played with the policy_frequency configuration?

kargarisaac commented 3 years ago

Thanks for the answers. I'll check them. I just used a simple DQN but it seems it's not suitable. As you mentioned in the paper and also based on the code, the order of observations will be sorted at each time step and the network cannot learn sth useful from it. I try to use a spatial grid or your social attention model. But first I need to be sure that the env works.

for Q6, I didn't change anything. Couldn't find the reason yet. I will push the code and make a pull request. Then you can check if I'm doing sth wrong.

eleurent commented 3 years ago

I started development of this feature on the dev-multiagent branch

kargarisaac commented 3 years ago

I also pushed my codes in marl branch in my fork. Do I need to make a pull request?

It is possible to have maximum 4 learning agents that can start randomly or you can set the start and end points. It is also possible to have or not other vehicles. But I just did this for the intersection env.

You can test the env using the scripts/test_ma.py file.

eleurent commented 3 years ago

I had a look at your branch , and here's what I found: I think you did everything right, and the problem actually lies in the training/wrapping code. You are probably checking env termination by a condition such as:

while True:
    obs, reward, done, info = env.step(action)
    if done:
       break

Except you changed the done variable from a boolean to a list of booleans (one per agent). Thus, after the first action, you check the condition if [False, False] which evaluates as True, and the environment stops. Changing the condition by any(done) fixes the problem for me. It must also be done in the _simulate() function, which checks self._is_terminal() to stop simulation before action completion.

eleurent commented 3 years ago

marl

kargarisaac commented 3 years ago

Thank you for the feedback. In test_ma.py I use sum(done) to reset env. All agents have the same done flag because it is based on episode length, but for some more general cases that the done flags are different (if we change the _is_terminal() function). I will do what you mentioned.

And the gif is trained using your social attention method again, right? it's nice.

So how can I continue to collaborate now? I'm sure you know the codes better and your implementation is cleaner and more general. I will try to follow the same pattern.

eleurent commented 3 years ago

Thank you for the feedback. In test_ma.py I use sum(done) to reset env. All agents have the same done flag because it is based on episode length, but for some more general cases that the done flags are different (if we change the _is_terminal() function). I will do what you mentioned.

Then the problem is probably caused by the condition if self.done or is_terminal() in _simulate(): this condition will pass with the returned [False, False] value, meaning that only one single time step will be simulated rather than 15 for each action.

And the gif is trained using your social attention method again, right? it's nice.

Aha no, it is actually a cherry-picked collision-free run with a random policy. But I'll train a social attention policy as soon as the remaining issues are fixed:

So how can I continue to collaborate now? I'm sure you know the codes better and your implementation is cleaner and more general. I will try to follow the same pattern.

The branch is almost ready to be merged, I only need to sort out some remaining questions:

Should all environments be configurable as multi-agents, or should we have multi-agent variants of each environment (like you did)?
How should done and rewards be defined? Probably as tuples, like you did, but this will not be compatible with default wrappers from openai gym (e.g. Monitor/Stats Recorder, which expect a float reward rather than a tuple). It is probably fine if it is only for multi-agent variants, though.

kargarisaac commented 3 years ago

Should all environments be configurable as multi-agents, or should we have multi-agent variants of each environment (like you did)?

I think your code is much cleaner. I merged it into my own fork and will continue based on that. I like to add some other capabilities like setting start and end position for ego agents for example which is simple. But I think the way you did is better.

How should done and rewards be defined? Probably as tuples, like you did, but this will not be compatible with default wrappers from openai gym (e.g. Monitor/Stats Recorder, which expects a float reward rather than a tuple). It is probably fine if it is only for multi-agent variants, though.

I think it is possible to have separate rewards (local rewards) for each agent, instead of having a total reward. For example in DeepDrive-Zero env, each agent has its own reward. In particle environments from openai, usually, a team has a global reward. In multi-walker environment, agents can have both a local or team reward. I think for self-driving tasks like this, the local reward is fine, similar to what is at the moment. using self-play and shared weights between agents and also some sort of centralized training, they can learn to cooperate.

About the wrappers, you are right. I have the same problem with Monitor wrapper for multi-agent version. I wanted to modify it for Multi-agent setting but didn't have time yet. Maybe this weekend.

Suggestion: If we consider the episode length as the only factor to terminate the episode, we can have one single done value. And we can also return the sum of all rewards as reward and return the reward for each agent separately in info. It Is also possible to concatenate all the obs and action in one tensor like a single-agent RL and just use the number of learning agents, maybe from info to separate them outside of the environment. I think all of these can be done by a wrapper. I mean we can do everything in a standard way, separate obs, action, reward, done, and then write a wrapper to convert the multi-agent env to a single-agent env, as I explained above, and then use the wrapped env as input to the Monitor wrapper.

eleurent commented 3 years ago

I like to add some other capabilities like setting start and end position for ego agents for example which is simple.

Feel free to open a PR 😃

It Is also possible to concatenate all the obs and action in one tensor like a single-agent RL and just use the number of learning agents, maybe from info to separate them outside of the environment

Yes, for the observation and action it is working already by returning tuples, since there is rarely an assumption on their shapes. I think the main issues are with the reward & done variables, which are often expected to be float and bool respectively in most RL frameworks. I like your suggestion, but I think it would be the other way round:

the default framework is single-agent (cooperative reward), so that it does not break traditional RL pipelines.
but we add local rewards/done signal to the info field
we can add a MultiAgentWrapper that returns theses tuples directly.

kargarisaac commented 3 years ago

Yes, it seems good. The single reward value would be sum of local rewards, right? The user will handle what he wants outside of the env.

eleurent commented 3 years ago

Yes, it seems good. The single reward value would be sum of local rewards, right? The user will handle what he wants outside of the env.

Right. I will finish this dev and merge the branch soon.

eleurent commented 3 years ago

I am running a first DQN training in the multi-agent setting (4 agents + other vehicles) , it's looking quite good so far! (~50% progress)

It converges faster than I expected, due to the x4 number of samples for each environment step.

kargarisaac commented 3 years ago

sounds great. I see that you updated the code too. thanks

eleurent commented 3 years ago

There it is. marl

I think the branch is ready to be merged, even if the feature is not 100% finished. There mostly remains:

Add multi-agent support for all environments (rewards and terminal states must account for every controlled agent)
Add documentation

kargarisaac commented 3 years ago

It looks amazing. The right values are for actions, right? all the green cars are trained separately? any weight sharing? I think at least having one sample environment would be enough and users can do the same thing for other environments. I can do that also.

eleurent commented 3 years ago

Yes, the left pane shows the estimated Q-values for the 1st controlled vehicle, and the green cars all share the same policy.

parvkpr commented 3 years ago

Hi @eleurent , was this feature merged with the main branch? I am trying to control 2 vehicles by setting the controlled_vehicles parameter as 2 in the config for highway-v0 environment. I am also using continuous action space. The observation space is perfect but the action space is of type Box(2,). It is my understanding that this represents the steering and throttle for one controllable vehicle. I am not able to figure out how to control the other vehicle. Please let me know how to resolve this

parvkpr commented 3 years ago

Hi @eleurent , was this feature merged with the main branch? I am trying to control 2 vehicles by setting the controlled_vehicles parameter as 2 in the config for highway-v0 environment. I am also using continuous action space. The observation space is perfect but the action space is of type Box(2,). It is my understanding that this represents the steering and throttle for one controllable vehicle. I am not able to figure out how to control the other vehicle. Please let me know how to resolve this

I was able to resolve this. For anyone else looking for multi-agent control config, Here is the code I used. (Adding this since there are no docs on this yet.) env.configure({ "action": { "type": "MultiAgentAction", "action_config": { "type": "ContinuousAction" } }, "controlled_vehicles": 2, "vehicles_count": 0, "absolute" : True }) The step call would be of the form: env.step((ac_car_1, ac_car_2)) where ac_car_1 and ac_car_2 are the individual actions for car 1 and car 2 respectively. (The step function expects tuple of action spaces)

eleurent commented 3 years ago

Well done for working this out @parvkpr, and sorry for the lack of documentation yet, I'll fix it as soon as possible.

parvkpr commented 3 years ago

Hi @eleurent, thank you! Can you help me out with setting locations of ego cars? I need them to be in the same lane. Should I open a new issue for that?

eleurent commented 3 years ago

Should I open a new issue for that?

Yes please :)

parvkpr commented 3 years ago

Hi @eleurent , I have opened #111 for my use case as suggested.

stefanbschneider commented 3 years ago

@eleurent The trained multi-agent scenario looks great! Just a conceptual question (I might've missed it): Do your multiple agents all take actions simultaneously in each step or in some sequential order?

I'd expect that a large difficulty in multi agent is the variance resulting from other agent's actions. If an agent takes an action and 3 other agents also take actions at the same time, the resulting observation and reward are likely influenced by all 4 actions, but each agent only knows about its own. Is that a problem?

eleurent commented 3 years ago

Hey, thanks :)

Do your multiple agents all take actions simultaneously in each step or in some sequential order?

Simultaneously. More precisely, there is a first loop where each agent make a decision of which control actions to apply given the current (fixed) traffic state. And only then, these actions are executed simultaneously on one timestep (independent forward of the kinematic model for each vehicle).

I'd expect that a large difficulty in multi agent is the variance resulting from other agent's actions. If an agent takes an action and 3 other agents also take actions at the same time, the resulting observation and reward are likely influenced by all 4 actions, but each agent only knows about its own. Is that a problem?

Yes, I think that the fact that the agent only knows about its own action is a central problem in game theory, regardless of whether the players act simultaneously or sequentially.

DongChen06 commented 3 years ago

There it is.

I think the branch is ready to be merged, even if the feature is not 100% finished. There mostly remains:

Add multi-agent support for all environments (rewards and terminal states must account for every controlled agent)

Add documentation

Hi author, does this code open-sourced for the multi-agent version?

eleurent commented 3 years ago

Sure, it is available on rl-agents.

You can run it with

cd scripts
python experiments.py evaluate configs/IntersectionEnv/env_multi_agent.json \
                               configs/IntersectionEnv/agents/DQNAgent/ego_attention_2h.json \
                               --train --episodes=3000

EDIT: Actually, I think this exact run was with 4 attention heads (you can edit it in the ego_attention_2h.json configuration), but I don't think it changes anything.

AizazSharif commented 3 years ago

Hi Eleurent,

First, congratulations on your Ph.D. defense. I have been following your work on highway-env and it's amazing. I was going through this open issue since I also needed a multi-agent environment. I wanted to ask if it is merged as a feature in highway-env itself since the current code is within rl-agents.

Any information will be appreciated.

Thanks.

eleurent commented 3 years ago

Thank you for your kind words!

Yes, this is merged to the main branch. This issue is still open because I need to add some documentation...

To have a multi-agent environment, you must configure it as follows:

env.configure({
    "controlled_vehicles": <desired number of controlled agents, e.g. 3>,
    "action": {
        "type": "MultiAgentAction",
        "action_config": {
            <desired action configuration, e.g. "type": "DiscreteMetaAction">
        }
    },
    "observation": {
        "type": "MultiAgentObservation",
        "observation_config": {
            <desired observation configuration, e.g. "type": "Kinematics">
        }
    }
})

This will set the observation and action spaces as Tuples of traditional observations and actions.

Some agents in rl-agents support these Tuple spaces for the multi-agent setting, for example the DQN agent uses a central policy to map each agent-observation in the observation tuple to its associated agent-action in the action tuple.

Does that help?

AizazSharif commented 3 years ago

Thanks a lot for the response @eleurent.

Yes, it is helpful since I wanted to know how to use env.configure here. I will try this out and let you know.

AizazSharif commented 3 years ago

Hi @eleurent

I tried the snippet but I was unable to reproduce the multi-agent environment in parking_her.ipynb code. Can you help me out with this?

Thanks.

francissunny25 commented 1 year ago

ls when agents reach their targets, it can also be the case that agents do not have targets and simply keep driving (depending on the environment of course). And if an agent has a collision, it can get a terminal penalty and other agents will be able to keep training for the rest of the episode. But sure, we may as well reset the whole env upon collision of any agent, why not.

Can the terminal state for multi-agent setting be configured to end when any of the agent collides? Is it done by adding the logic for any agent collision in _is_terminal() of the required environment or can this be controlled through the config?

eleurent commented 1 year ago

Can the terminal state for multi-agent setting be configured to end when any of the agent collides?

I think this is often the case by default, but no this is non-configurable as of now.

Is it done by adding the logic for any agent collision in _is_terminal() of the required environment

Yes exactly, see e.g.:

https://github.com/eleurent/highway-env/blob/ab209de9c1a0da7524e74eb817beec04e6415a0d/highway_env/envs/intersection_env.py#L83

You can even make it configurable and send a PR :)

francissunny25 commented 1 year ago

Thank you for the reply, and sorry for commenting on a closed issue.

Why do we have two methods for calculating terminal state (_is_terminal and _agent_is_terminal)? Is the method _is_terminal used to terminate the episode and the method _agent_is_terminal used to get the info for a single agent done?

To make terminal state configurable I was thinking of adding a parameter in the default config like 'terminate_on_any_agent_collision' and changing the above line to if terminate_on_any_agent_collision: return any(vehicle.crashed for vehicle in self.controlled_vehicles) \ else: return all(vehicle.crashed for vehicle in self.controlled_vehicles) \

Would this be the right approach?

eleurent commented 1 year ago

Why do we have two methods for calculating terminal state (_is_terminal and _agent_is_terminal)? Is the method _is_terminal used to terminate the episode and the method _agent_is_terminal used to get the info for a single agent done?

Correct!

Would this be the right approach?

Absolutely! You can maybe make it slightly more general by writing something like:

agent_terminal = [self._agent_is_terminal(vehicle) for vehicle in self.controlled_vehicles]
agg_fn = {'any': any, 'all: all}[self.config['termination_agg_fn']]
return agg_fn(agent_terminal)

Farama-Foundation / HighwayEnv

Scaling to multiple agents #35