duckietown / gym-duckietown

Self-driving car simulator for the Duckietown universe
http://duckietown.org
Other
51 stars 19 forks source link

Smooth reward function for collision detection #24

Closed maximecb closed 6 years ago

maximecb commented 6 years ago

It seems very difficult, at the moment, to train an imitation learning agent to avoid collisions in loop_obstacles. I think part of the problem might be that the reward function is not smooth. You maximize the reward by staying along the lane curve, right until you collide, at which point you get -1000.

It would be better if we could have some smooth decrease in reward for being close to objects. This may be nontrivial to do well, as this reward function will "compete" with the one drawing the agent into the lane. The reward function needs to be properly tested both with agents trained with reinforcement learning, and agents trained with imitation learning.

bhairavmehta95 commented 6 years ago

If we define "safety circles" around objects (an example radius could be extent * scale * safety_factor, and defines a circle at object[pos]), then we can calculate the amount of overlap between our agent's safety circle and an obstacles safety circle, and use that as negative reward (you can even devise it to be exponentially worse to have more overlapping area, rather than just a linear scale. I'd say exponential is better - it is much more than 2x scarier / worse to get within 1m of an obstacle than within 2m of an obstacle).

The intersection check is easy and cheap, and while the area of the overlap requires some geometric manipulation, you can also do it pretty quickly with a formula.

The main issue I'd see is with respect to what object you'd calculate this reward component w.r.t, especially because euclidean distances wouldn't be the best method and you wouldn't want to loop through every obstacle. To solve this (?):

your simulator already stores the position of each object, the position of the agent, and the vector in which the agent is heading. If you project the position of each object onto the directional vector of the agent, then you get a sort of "number line".

From there, because our agent will only move forward, we can use the closest object that has a positive difference on this "number line" and use that as the proxy. (And, we can do it all in numpy ! :smiley: )

It's not perfect, and maybe testing will show we may need a few objects rather than just the closest one, but let me know what you think.

maximecb commented 6 years ago

In terms of speed, once again, I think it's numpy to the rescue. We should check against every collidable objects in parallel. Another optimization we may want to work on is: the duckiebot already dies if it steps out of a drivable tile. Hence, we don't have to check for collisions against objects that are fully outside of drivable tiles. This might be necessary, in fact, because if you look at the udem1 level, the house is very close to the edge of drivable tiles. A safety buffer would likely penalize the agent for driving next to it.

We may want to give the agent the sum of the penalty incurred for being too close to any object. That would be simple enough. My main concern is that this negative reward may drive the agent to do crazy things. For instance, there is negative reward for driving in the left tile, there is even worse for driving into an obstacle on the right tile. Maybe turning back the way you came incurs less penalty. I suppose the only way we can really know is to test these things out.

bhairavmehta95 commented 6 years ago

So are you proposing to give the agent a (negative) collision reward based on the sum of distances to all objects on driveable tiles? (i'm not sure what the "sum of the penalty incurred for being too close to any object" refers to)

If we're computing euclidean distances and then using those as part of the reward, my main issue w. this that these scheme will still negatively reward you after you pass an object, which shouldn't really be the case.

The collision checking for objects not on driveable tiles can be done when we load the objects themselves. Currently, we check if the object is visible (if its not, we don't add it to the array of collision-check objects), so we can just add an additional check to see if its on a driveable tile.

Anyway, I can add this check into my open PR #28 , and do some testing on how well it works.

maximecb commented 6 years ago

I think we largely mean the same thing. The sum of the penalty from all "safety circle" overlaps. Computed in parallel.

I would save the not-on-drivable-tile check for another PR. I prefer to keep the PRs small and test things separately. Also need to keep efficiency in mind once again, preferably not loop for each drivable tile for each object.

bhairavmehta95 commented 6 years ago

I wrote both versions:

  1. Reward is based on the closest projection point of the directional vector
  2. Calculating the sum of the penalty from all "safety circle" overlaps,

and started empirically testing them.

Drawbacks of 1.: Because its using the directional vector, it still has a similar problem to this non-consistent reward: As soon as you pass an object, you no longer get any negative reward from that object.

Drawbacks of 2: Checking all of the points at once gives us some nan errors when computing the arc cosine (which is needed to compute the amount of overlap).

Lmk what you think; I'd personally prefer (1), but either is fine.

maximecb commented 6 years ago

I don't love 1 because it depends on the direction vector, which it seems to me shouldn't matter.

For 2, I think you could do it without arc cosine, because we don't need some accurate estimate of the amount of overlap. You can compute the distance between the center of both objects, subtract the sum of the safety circle radii, and use that as an overlap metric.

bhairavmehta95 commented 6 years ago

Okay. I will use that. Hopefully, the static obstacles PR and this PR should both be in within a few hours.

bhairavmehta95 commented 6 years ago

Both PRs are in, but I imagine that there will be some merge conflicts (?unsure) between this PR and #32 . I will fix whatever is needed in the other as soon as one is merged.

maximecb commented 6 years ago

I'm adding the penalty. Still struggling to train an agent unfortunately. As you said, maybe some hyperparameter tuning is required. Though this is also eroding my faith in RL.