Pytorch RL learns bad policy with default paramters

thomas-w-nl commented 4 years ago

Has anyone successfully trained a good policy with the current default parameters using the RL template? After training multiple times for over 1 million timesteps (60 000 episodes) the only policy that has been learned is to turn in a circle. I tried using the SteeringToWheelVelWrapper to learn only heading with a fixed velocity of 0.5 but this did not fix the issue. I also tried to limit the number of rotations allowed, resetting the gym after more than 4* 360 deg of angle difference. However none of these approaches work. Should I train much longer or is something broken?

Looking at the reward over time for a run of over 4000 episodes, training more would appear not to result in anything useful. (Validation reward is in blue)

thomas-w-nl commented 4 years ago

After training for 30 000 episodes on "straight_road", with the robot always starting in the ideal position it still does not learn to drive forwards and always turns straight off the map. Is something broken?

liampaull commented 4 years ago

I have also mostly seen this behavior. @bhairavmehta95 or @velythyl might have more insight.

Velythyl commented 4 years ago

I had the same behaviour when I started working on that repo.

I noticed a few bugs in the code, and have fixed them. I was going to do a PR, but I turned my attention to imitation learning lately so I didn't finish it.

I'll get on it tomorrow, I still have to clean up my code and commits but I should be able to open the PR either tomorrow or the day after (I'll have to train it to make sure everything works, and that takes a lot of time).

With the fixes, the car converges to either turning in a circle or going straight (it has a really hard time with curves). I didn't test it with a really long training time because of hardware constraints though, so that might be it.

Velythyl commented 4 years ago

It's still training, but just to be sure I've fixed the faulty behaviour - does this seem better to you? Right now it's only at 60k timesteps, so I'm sure with more computing power it could become better. It's not just turning in circles anymore, though it still likes turning more than going straight (but again, I think with more training that quirk might disappear).

It feels consistent with a "very early RL training that's still exploring the action space" model.

fixedGif

If this seems good I will open the PR.

Velythyl commented 4 years ago

Okay, I just saw it take two turns in a row at 90k timesteps, so I'll consider this fixed. I'll open the PR.

thomas-w-nl commented 4 years ago

That looks a lot more promising! In my experience the algorithm very quickly learns a max turning angle and really does not want to change. Im very interested in the fixes, thanks a lot for your help. Ill start training for the night, it should do at least 1m timesteps by tomorrow then.

After experimenting a little further, this code seems to deviate from the original DDPG by taking as many gradient steps as there were timesteps in the episode, however that does seem a little excessive, and it did perform better after i reduced the number of gradient steps per episode.

Velythyl commented 4 years ago

I don't have the rights to link issues or assign reviewers, but here is the PR: https://github.com/duckietown/challenge-aido_LF-baseline-RL-sim-pytorch/pull/33