decisionforce / HACO

[ICLR 2022] Official implementation of paper: Efficient Learning of Safe Driving Policy via Human-AI Copilot Optimization
Apache License 2.0
42 stars 10 forks source link

Got bad performance when reproducing HACO #6

Open dq0803 opened 1 year ago

dq0803 commented 1 year ago

Hi, I just attempt to reproduce HACO with keyboard by running "train_haco_keyboard_easy.py ", but encountered unsatisfactory training performance.

At the early stage, I can see the model was improved with the help of human interventions. After around 20~40 iterations, the car has learned some driving skills, and occasionally managed to reach the destination, albeit with uneven performance. However, after a few more iterations, strange things occurred. The car failed to start normally and would brake suddenly while driving. It seems like the model forgot the skills it previously learned and its performance worsened.

Could you please explain the reasons behind this issue? Is it related to improper timing for human intervention, an excessive focus on exploration, or some other factor?

The screenshot below is the evaluation results by running "eval_haco.py ", with EPISODE_NUM_PER_CKPT = 2.

eval_res
pengzhenghao commented 1 year ago

Hi, I just attempt to reproduce HACO with keyboard by running "train_haco_keyboard_easy.py ", but encountered unsatisfactory training performance.

Thank you for running our code!

At the early stage, I can see the model was improved with the help of human interventions. After around 20~40 iterations, the car has learned some driving skills, and occasionally managed to reach the destination, albeit with uneven performance.

This is expected.

However, after a few more iterations, strange things occurred. The car failed to start normally and would brake suddenly while driving. It seems like the model forgot the skills it previously learned and its performance worsened.

This is also expected and observed during our experiments.

Could you please explain the reasons behind this issue? Is it related to improper timing for human intervention, an excessive focus on exploration, or some other factor?

Sure! First I am really happy you reproduce (even the strange behavior) our experiments.

In the beginning of the training, we sometimes will take a full control for 1 or 2 episodes, in order to fill the human buffer with more useful data. Then we will enter human-AI shared control and intervenes when something go wrong.

In our experiment we are very conservative and maintain the speed to near 15~20 kmh (so either the throttle we gave during intervention is also in a low value). We never brake during intervention, because we find the policy will soon converge to emergency stopping and never move again. And as you already saw, after training a long period the policy will collapse.

We have some hypothesis behind this:

  1. the Q values might be too large as the outcome of CQL loss
  2. the acceleration demonstration only occupies a very small portion of the human buffer and thus those data samples are hard to be sampled, neither to be learned.
xiaozhao12345 commented 8 months ago

Hello, how to solve the problem of suddenly emergency braking after several iterations?

pengzhenghao commented 7 months ago

Hello, how to solve the problem of suddenly emergency braking after several iterations?

That's a good question! We observe that too!

The answer is:

  1. do not brake at all as a human demonstrator
  2. keep slow and almost constant speed
  3. try to use our new algorithm PVP: https://github.com/metadriverse/pvp