Closed edavattathu-ronnie closed 1 year ago
Hi,
For 1., you can configure the environment as follows:
env.configure({"vehicles_count": 5})
Be careful and ensure that the positions of these vehicles are present in the observation (they are not if you use the default parking observation), e.g. you could use an image observation, or lidar + goal, etc.
For 2., its generally non-trivial to debug an RL algorithm as there is a number of things that could go wrong, so I'm afraid I don't have any silver bullet solution for you. If you believe your algorithm is converging too early to a suboptimal local minimum, you may want to increase the amount of exploration (e.g. random noise injection, use temporally correlated noise, etc)
Yeah eleurent,
Thanks for your response! I did a similar implementation in the virtual space of CARLA. SAC+HER is working fine, just some wobbling is still there in the maneuvering of the vehicle when it starts getting more closer to the parking spot. Can you throw some light on the reward setting you designed when you have nearby parked cars in the parking lot? Also in the parking-v0 environment, as part of the environment, you have assigned the acceleration as one of the action values predicted by the agent. unfortunately in CARLA it's not possible to predict the acceleration instead I have to predict the throttle, brake, and steering commands. So these are the 2 things right now I am struggling with, when the vehicle is far away from the parking spot it brakes more often and thus it rarely moves in the initial episodes. And secondly, the wobbling issue, when the vehicle starts to learn to reach the parking spot. Could you please share your opinions on these? And also in CARLA, I don't have the flexibility to reverse the vehicle just by putting a negative acceleration(deceleration) value, since in CARLA, to switch to reverse we have to literally toggle the gear.
And also I believe the input argument going into "self.road.step(1 / self.config["simulation_frequency"])" is the rate at which the action is being predicted by the agent, right? since I believe I need to get the exact same time_step value between two actions in CARLA, since the rate at which the action is being applied by the agent also matters a lot. 1) So what time step between 2 actions should I set-up? 2) And secondly the steering action being predicted by the agent, does it have any sort of relation to the previous steering action predicted by the agent? Like: current_action = current_action(predicted) + last_action ?
just some wobbling is still there in the maneuvering of the vehicle when it starts getting more closer to the parking spot. Can you throw some light on the reward setting you designed when you have nearby parked cars in the parking lot?
My reward only has a distance-to-goal term and a collision term, nothing for comfort/jerk. The amount of wobbling that you get will depend on lot on your choice of action space and your dynamics (e.g. you could have a low pass filter, e.g. preventing the steering wheel angle to jump instantly from -180deg to +180deg). If you don't want to change the dynamics, you will have to go through the reward function which is probably a bit more difficult because the agent will need to figure it out. You can e.g. add a reward term penalising angular velocity or angular acceleration, but for your agent to be able to measure it you will also need to provide these as measurements (or several frams if you are using images).
Also in the parking-v0 environment, as part of the environment, you have assigned the acceleration as one of the action values predicted by the agent. unfortunately in CARLA it's not possible to predict the acceleration instead I have to predict the throttle, brake, and steering commands.
In theory the agent can also find a good policy in this alternative action space, but as you noticed it probably makes exploration difficule as random actions are not as useful. So either you find some way of how your agent explores (e.g. by using temporally-correlated noise, prevent using the brake action if it was used recently, etc.), either you change the action space so exploration is easier: e.g. you could write a low-level acceleration controller which translates desired acceleration (output by the agent) into throttle and brake commands, depending on the vehicle state.
And also I believe the input argument going into "self.road.step(1 / self.config["simulation_frequency"])" is the rate at which the action is being predicted by the agent, right?
Correct!
So what time step between 2 actions should I set-up?
A smaller timestep means:
And secondly the steering action being predicted by the agent, does it have any sort of relation to the previous steering action predicted by the agent? Like: current_action = current_action(predicted) + last_action ?
As of now, this project uses the Kinematics Bicycle Model. The steering action is used to set the front wheel angle delta_f, which is unrelated to the previous one (unlike what you suggested). However, there is an integration step in the dynamics: the front wheel angle delta_f is converted to a slip angle beta, which is then integrated to give the variations of heading. This integration step has a low-pass filtering effect, and contributes to smoothing the jitter in steering actions a bit, kind of like what you wrote.
First of all, I really appreciate you for taking some time to go through these questionnaires!!! I tried including the angular velocity and even the angular acceleration into the reward functionality, but it is getting just worse in terms of the performance of the vehicle in parking at such a tight space. So, let's see I am still playing with the reward functionality.
Just thinking about your last point about the timesteps between 2 consecutive actions by the agent. Having smaller timesteps will have a better dynamics understanding from a simulation perspective, does this mean that smaller time step will study the dynamics of the vehicle much more in detail, like-in we are able to closely observe the dynamics of the vehicle more often? but thinking from a simulation perspective, if we have such smaller time steps it also means that there is a high possibility that your agent will hardly learn anything, since the actions are being outputted at such high frequency? Also could you please elaborate on the harder exploration strategy for the agent(your third point on the list?)
I tried including the angular velocity and even the angular acceleration into the reward functionality, but it is getting just worse in terms of the performance of the vehicle in parking at such a tight space.
Yes I imagine this can make the problem harder, especially since these signals are derivatives which makes them quite noisy, which means the value gradients will also have a high variance. I would probably try to change the dynamics instead, to force it to be smoother regardless of the actions of the agent?
Having smaller timesteps will have a better dynamics understanding from a simulation perspective, does this mean that smaller time step will study the dynamics of the vehicle much more in detail, like-in we are able to closely observe the dynamics of the vehicle more often?
I'm not sure what you mean by study/observe, but essentially at each timestep we are integrating the derivative of the vehicle state with an Euler scheme, which assumes that the derivative is constant over the integration timestep. This assumption is false, which will result in an approximation error in the vehicle trajectory, all the more if the integration timestep is large. If approximation errors get too big, the trajectory can even become unstable / start oscillating even with smooth commands.
but thinking from a simulation perspective, if we have such smaller time steps it also means that there is a high possibility that your agent will hardly learn anything, since the actions are being outputted at such high frequency? Also could you please elaborate on the harder exploration strategy for the agent(your third point on the list?)
Yes exactly, on the other hand if the frequency is too high then each action is executed on a very small duration so it has barely an impact on the state compared to an alternative action. And the total number of actions in the trajectory also increases, which makes credit assignment harder. Another effect if that given we explore with random actions, having a higher action duration means that each action will take the vehicle further before we sample another action. Indeed, since the exploratory action noise is typically zero-mean, there is a washing out of uncorrelated random actions, e.g. a positive acceleration will on average be followed by a negative acceleration which will cancel out its effect of setting the vehicle in motion. If the positive acceleration command is applied for longer, then it has more of an effect before it is being canceled.
I hope this is clear, and btw these are just my own intuitions
Thanks for all the comment and help really appreciate it. I hereby close the issue here.
Hey @eleurent first of all thanks a lot for all this gym environments, which made my life a bit easy, since I was working with CARLA for validating my SAC agent for autonomous parking. Two things: 1) One is there a way I can spawn more vehicles at the other parking spots (other than the goal position)? 2) Secondly I have my own SAC agent with PER implemented, but I find it difficult to make the vehicle learn to park, I can see from the log that the vehicle(agent), falls into a local optima.