Trained vehicle tends to freeze at the entrance of roundabout env

juliayun23 commented 3 years ago

Hi, @eleurent, recently I've been exploring your excellent project. I used PPO2 from stable-baselines to train vehicle in roundabout env, one weird phenomena I noticed was that the trained vehicle tends to freeze at the entrance of roundabout (not move anymore). This happened quite often when I evaluated the trained model. I'm not sure whether it's the problem of hyperparameters but I didn't make much changes to hyperparameters. Do you have any idea why this happens and suggestion on how to solve this?

My testing code was like this:

env = gym.make("roundabout-v0")
model = PPO2(MlpPolicy, env, verbose=1, learning_rate=lambda f: f * 5.0e-4, cliprange=lambda f: f * 0.1)
model.learn(total_timesteps=int(1e5))
model.save('ppo2_mlp_roundabout')

Thanks for your help.

eleurent commented 3 years ago

Hi @nomatterhoe, Thank you for that feedback, that's interesting. I think that for this roundabout environment, I actually have never tried model-free algorithms so far, only model-based (planning) approaches. I'll try and see if I can reproduce the conservative behaviours that you get.

Hypothesis: the model has trouble figuring out whether or not a vehicle is coming, and settles for the safe option. This could be alleviated by a different choice of observation type and neural network architecture, see e.g.this work on the intersection environment.

eleurent commented 3 years ago

That being said, I just noticed that the default observation for roundabout is the Time To Collision observation, which is not appropriate since it was designed for straight roads. I just changed it to a Kinematics observation, which is better suited (OccupancyGrid could work as well)

juliayun23 commented 3 years ago

Thanks for your reply @eleurent , following your suggestion, I tried the Roundabout env with OccupancyGrid observation and Kinematics observation, the freeze problem was solved in both cases. But it seems like the training curve won't converge (following is the roundabout env with Kinematics observation trained using ppo2)

The training curves also can't converge in Intersection env when I trained the vehicle using dqn and ppo2, I tried adjusting learning rate and total_timesteps, didn't make differences. I had a look at your paper, it seems like you customized the DQN algorithm for the intersection env by adding attention layers in your network architecture. So my new question is, in your opinion, is it possible at all that a baseline algorithm can perform/learn well under these complex driving environments such as roundabout and intersection?

eleurent commented 3 years ago

But it seems like the training curve won't converge

It seems to me that training does converge (i.e. to a local maximum) in about 100k steps? You may be troubled by these downward spikes: they probably correspond to accidents. However,

I do not think that most RL algorithms are expected to always have a smooth and monotonously improving reward. That may be the case for environments where the reward is smooth (e.g. distance to a goal), but this is not the case here because of the discontinuities related to collisions. Since the agent has an incentive to drive in near-collisions states (tailgating other vehicles) because it yields a higher reward (related to velocity), these collisions are expected with random exploration.
It may be the case that, for this definition of the reward function and given uncertainty over the behaviors of other vehicles, it is actually worth it (in terms of expected return) to risk having a collision once in a while, it it allows getting rewards for high velocity the rest of the time. Increasing the collision penalty will decrease the number of accidents, but can in turn result in an overly conservative policy (freeze at the entrance of roundabout).

The training curves also can't converge in Intersection env when I trained the vehicle using dqn and ppo2, I tried adjusting learning rate and total_timesteps, didn't make differences. I had a look at your paper, it seems like you customized the DQN algorithm for the intersection env by adding attention layers in your network architecture.

First, note that the curves reported on the social attention paper show the reward averaged over many random seed, and not a single training run. This explains why we do not see these spikes, but they were still present in individual runs.

So my new question is, in your opinion, is it possible at all that a baseline algorithm can perform/learn well under these complex driving environments such as roundabout and intersection?

It depends on what you call perform/learn well. If you mean e.g. reaching a 0% collision rate while still being able to cross the intersection / roundabout (i.e. no freezing robot), then no, I haven't observed such successes for baseline algorithms and architectures :/ I've mostly tried:

DQN with Kinematics Observation + MLP policy (poor combination since the MLP are not invariant to permutations of the vehicles in the observation)
DQN with Occupancy map + CNN policy
DQN with RGB images (low resolution, stack of 4 frames) + CNN policy

but did not experiment much with policy gradients.

Perhaps surprisingly, it seems that even for such simple simulations, baseline algorithms / architectures are not sufficient.

Farama-Foundation / HighwayEnv

Trained vehicle tends to freeze at the entrance of roundabout env #99