Farama-Foundation / HighwayEnv

A minimalist environment for decision-making in autonomous driving
https://highway-env.farama.org/
MIT License
2.64k stars 753 forks source link

Some questions about intersection scenes #540

Open yshichseu opened 11 months ago

yshichseu commented 11 months ago

Dear author, hello. I referred to your code to introduce the attention mechanism into the initial intersection environment and trained it using the PPO algorithm of Stableline3. After 50k training sessions, I noticed the following situations and would like to ask for your advice. Thank you! 1.During training, I found that the average reward had significant fluctuations. Is this normal for intersection scenes, or do I need to make any further improvements (such as increasing the initial set arrival reward and collision penalty) image 2.When observing the animation, I noticed two issues that I don't know how to solve. One is that the proxy vehicle accelerates as it approaches the finish line, which often leads to accidents when it collides with the vehicle in front of it. Can this be controlled by adding a minimum headway to the smart car (I noticed that other vehicles seem to maintain the headway)? Where do I need to set it? The second one is that after the training, I found that the vehicle still collides with oncoming vehicles from other directions at the intersection, and I don't know what caused this 1701933958562

https://github.com/Farama-Foundation/HighwayEnv/assets/106796102/8d0ed5c9-f0aa-46c9-bc53-c02e559b2210

In short, compared to before, I have made satisfactory progress. Once again, I would like to express my gratitude for your environment and contribution!

eleurent commented 11 months ago

Hi!

1.During training, I found that the average reward had significant fluctuations. Is this normal for intersection scenes, or do I need to make any further improvements (such as increasing the initial set arrival reward and collision penalty)

I think it's expected to have fluctuations, unless maybe if

  1. at each evaluation step, you average over a sufficiently large number of episodes. The policy and the initial state are both stochastic, and so is the policy return, so this is needed to accurately estimate the mean of the return distribution (policy value).
  2. you use a very small learning rate a high batch size for the policy, such that you're more likely to get monotonous improvements from PPO. But that will result in longer training times unfortunately.

When observing the animation, I noticed two issues that I don't know how to solve. One is that the proxy vehicle accelerates as it approaches the finish line, which often leads to accidents when it collides with the vehicle in front of it.

I can confirm that I observed the same thing, and I'm not 100% sure what's going on. My initial though is that this is just before the end of the episode (which has a maximum duration), so crashing there does not matter as much because there is not much "reward to go" anyway (e.g. if it's the last step), so no incentive to avoid collisions. But that explanation does not really hold up because there is still a quite high collision penalty (-5), which the agent should try to avoid. It might be due to a problem in the representation / function approximation for some reason, this could be checked by looking at whether the value function is correctly estimated before an imminent collision (e.g. is the predicted value closer to 0 or -5?).

Can this be controlled by adding a minimum headway to the smart car (I noticed that other vehicles seem to maintain the headway)? Where do I need to set it?

This is not possible at the moment, the longitudinal velocity is pretty much always controlled by the agent with the available action spaces (so the agent has the responsibility of maintaining this headway). But you may want to look into https://github.com/Farama-Foundation/HighwayEnv/issues/538 where something similar has been tried (defining an action space on top of the IDMModel, which does longitudinal control (it is used to control other vehicles) and has headway parameter (MINIMUM_DISTANCE).

, I found that the vehicle still collides with oncoming vehicles from other directions at the intersection, and I don't know what caused this

This is caused by insufficient learning. You may want to improve the agent somehow (increase capacity, training time, observation type / architecture, etc.). Of course you can also increase the collision penalty for instance, but this might also lead to overly conservative behaviours and exploration issues.

yshichseu commented 11 months ago

This is caused by insufficient learning. You may want to improve the agent somehow (increase capacity, training time, observation type / architecture, etc.). Of course you can also increase the collision penalty for instance, but this might also lead to overly conservative behaviours and exploration issues.

Thank you for your reply. For the first two questions, I will do some more work to improve performance. For the third question, I think I have conducted a lot of training, but the results show that the results of 50k and 200k are very similar, and there will still be collisions. I only introduced the PPO attention mechanism of the highway environment into the intersection environment. Do you need to pay attention to other areas during this process to avoid such problems, Perhaps I need more training and parameter adjustments? Thank you and your work, it has been of great help to me

eleurent commented 11 months ago

If training performance seems to plateau, maybe you are already "saturating" the current policy design and you need to increase the network capacity, or improve the architecture itself, or add additional information to the input observation, etc.

yshichseu commented 11 months ago

If training performance seems to plateau, maybe you are already "saturating" the current policy design and you need to increase the network capacity, or improve the architecture itself, or add additional information to the input observation, etc.

Okay, I will try and see if I can solve this problem! Thank you again for your reply!

yshichseu commented 10 months ago

If training performance seems to plateau, maybe you are already "saturating" the current policy design and you need to increase the network capacity, or improve the architecture itself, or add additional information to the input observation, etc.

Sorry to bother you, I think I have another question to consult. During the training process, how can I record necessary data such as self driving speed and collision rate, similar to rewards? How can I achieve this? Thank you for your help!

eleurent commented 10 months ago

Depending on your training library, there has to be some place where the environment is stepped: obs, reward, done, truncated, info = env.step(action)

Here, you can look at the returned info, which contains the vehicle speed and crashed. And you can also inspect the env object itself (e.g.env.road.vehicles) to record any desired information about the current simulator state. You can then save this data as needed, e.g. into a file.

yshichseu commented 10 months ago

Depending on your training library, there has to be some place where the environment is stepped: obs, reward, done, truncated, info = env.step(action)

Here, you can look at the returned info, which contains the vehicle speed and crashed. And you can also inspect the env object itself (e.g.env.road.vehicles) to record any desired information about the current simulator state. You can then save this data as needed, e.g. into a file.

Dear author, hello. I have output the speed value for each step through the SB3 callback function. I would like to know if this value represents the instantaneous speed at the end of each step. If I want to obtain the average speed for each episode, can I directly implement it in the program, which information needs to be obtained, or if additional processing is needed? Thank you very much!

eleurent commented 10 months ago

Yes, it represents the instantaneous speed. If you want the average speed, you'll have to implement it as this requires storing speed information during intermediate simulation frames (between two actions).

You can edit the Vehicle class, for instance by adding a total_speed variable, then at the end of step() you can add total_speed += speed, and then edit the environment's _info() method to return self.vehicle.total_speed / self.time (or if you have access to the env object, you can access it directly through env.vehicle.total_speed / env.time