Training with DQN('MlpPolicy') on Intersection Environment

@eleurent Hi Edouard,

I've trained my model in both Highway and Intersection environment. All the hyperparameters are as same as each other. And I used DQN (MlpPolicy) for both of them. But the problem is, in Highway, the agent after 2000-3000 steps learns to prevent collision, but in intersection even with 8000 steps it does not learn anything special.

The reward function for Highway is { r_collision = -1, r_speed = 0.4 } and because I'm considering just longitudinal actions, It's not rewarded for lane changing.

The reward function for Intersection is { r_collision = -1, r_speed = 0.4, r_arrived = 0.8}.

Observation and reward are normalized in both of them. "features": ['presence', 'x', 'y', 'vx', 'vy'] are for both of them. Do you think adding heading angle for intersection could be more important?

As you can see Highway env converges, but Intersection does not. I guess one potential problem could be using MlpPolicy in Intersection. But do you have any recommendation? In your paper you've used transformers, but I don't know how to implement it. Is there any simpler solution?

And also, do you have any better suggestion for shaping my reward function in Intersection?

The purple plot is for Intersection.

Indeed, this reward plot is not very good, it's surprising that reward does not improve at all throughout training.

It is true that the MlpPolicy is not best suited for this task and I had better results with a Transformer model, but in our paper we could still see some progress happening when training with an MlpPolicy and KinematicsObservation (see paper Figure 4, total reward increases from 2.1 to 3.8).

So I'm not sure what is going on exactly. Maybe it would be worth investigating with simpler domains and progressively increasing difficulty: e.g. remove all other vehicles at first, does the vehicle learn to always drive at maximum speed? Then add a single vehicle (always with the same initial position and velocity), does the vehicle learn to avoid it? If everything is fine so far, and learning only fails when scaling to the full scene with random vehicles at random positions/speeds, then its probably a problem of representation / policy architecture. But if the algorithm struggles even in these simpler scenarios, there is probably something wrong in the environment definition or learning algorithm.

Observation and reward are normalized in both of them. "features": ['presence', 'x', 'y', 'vx', 'vy'] are for both of them. Do you think adding heading angle for intersection could be more important?

The config that I used is this one. So yes, I did include heading angles. I think they are relevant because it helps understanding if a vehicle is starting to turn, and in turn whether or not their path is going to cross yours. See Figure 7 in the paper, which showed high sensitivity of the trained policy to the heading angle. Of course, part of this information is already included in the vx/vy velocity (except when it is close to 0), but it doesnt harm to include it additionally.

I also used absolute coordinates for intersection (but not for highway), is that your case too?

In your paper you've used transformers, but I don't know how to implement it.

You can take inspiration from this script where I implemented the custom transformer policy, to be used with PPO and the highway env, but this can be ported to DQN and the intersection env.

Alternatively, my original implementation of DQN + Transformer/MLP (the one used in the paper) is available in this colab.

Indeed, this reward plot is not very good, it's surprising that reward does not improve at all throughout training.

Thanks for your comprehensive and kind response, Edouard!

It is true that the MlpPolicy is not best suited for this task and I had better results with a Transformer model, but in our paper we could still see some progress happening when training with an MlpPolicy and KinematicsObservation (see paper Figure 4, total reward increases from 2.1 to 3.8).

So I'm not sure what is going on exactly. Maybe it would be worth investigating with simpler domains and progressively increasing difficulty: e.g. remove all other vehicles at first, does the vehicle learn to always drive at maximum speed? Then add a single vehicle (always with the same initial position and velocity), does the vehicle learn to avoid it? If everything is fine so far, and learning only fails when scaling to the full scene with random vehicles at random positions/speeds, then its probably a problem of representation / policy architecture. But if the algorithm struggles even in these simpler scenarios, there is probably something wrong in the environment definition or learning algorithm.

One problem is, I can't remove other vehicles, I mean even when I change "initial_vehicles_count" to 1 or 0, still there are other cars! One trick was make their speed zero, but at the end I need to modify their numbers. Is there any other part in the code that I should change it? Actually I think it's maybe related to "spawn_probability" as well.

Intersection

Observation and reward are normalized in both of them. "features": ['presence', 'x', 'y', 'vx', 'vy'] are for both of them. Do you think adding heading angle for intersection could be more important?

The config that I used is this one. So yes, I did include heading angles. I think they are relevant because it helps understanding if a vehicle is starting to turn, and in turn whether or not their path is going to cross yours. See Figure 7 in the paper, which showed high sensitivity of the trained policy to the heading angle. Of course, part of this information is already included in the vx/vy velocity (except when it is close to 0), but it doesn't harm to include it additionally.

I also used absolute coordinates for intersection (but not for highway), is that your case too?

yes, I did that too.

In your paper you've used transformers, but I don't know how to implement it.

You can take inspiration from this script where I implemented the custom transformer policy, to be used with PPO and the highway env, but this can be ported to DQN and the intersection env.

Alternatively, my original implementation of DQN + Transformer/MLP (the one used in the paper) is available in this colab.

And I guess there is a mistake in the Testing part of this notebook. Instead of evaluation.train(), I guess it should be evaluation.test(), and also in "evaluation = Evaluation(env, agent, num_episodes=20, training = False, recover = True, display_agent=False)" we should put "recover = True" to use the latest model.

One problem is, I can't remove other vehicles, I mean even when I change "initial_vehicles_count" to 1 or 0, still there are other cars! One trick was make their speed zero, but at the end I need to modify their numbers. Is there any other part in the code that I should change it? Actually I think it's maybe related to "spawn_probability" as well.

Setting spawn_probability to 0 should help yes, but if it's not enough you should just edit intersection_env.py and comment out the content of spawn_vehicles() and most of make_vehicles() (except ego-vehicle and goal creation)

And I guess there is a mistake in the Testing part of this notebook. Instead of evaluation.train(), I guess it should be evaluation.test(), and also in "evaluation = Evaluation(env, agent, num_episodes=20, training = False, recover = True, display_agent=False)" we should put "recover = True" to use the latest model.

You're right! I wonder how I missed that... I'll fix it, thanks for the feedback.

Farama-Foundation / HighwayEnv

Training with DQN('MlpPolicy') on Intersection Environment #586