IntelLabs / coach

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms
https://intellabs.github.io/coach/
Apache License 2.0
2.32k stars 459 forks source link

The reward function in carla_environment.py #278

Closed fangchuan closed 5 years ago

fangchuan commented 5 years ago

Hi, recently i am concerned on my graduation project in CARLA, I have noticed that the reward function of CARLA in coach was totally different from the formula introduced by "CARLA: An Open Urban Driving Simulator". While in the implementation of carla_environment.py, I saw the reward was calculated in this way:

` self.reward = speed_reward

Honestly, I have trained my agent based on the reward formula of CARLA's paper, it seemed he needs many episodes to run util produce a good performance, sometimes, it even couldn't converge, although I used the similar network in DDPG algorithm. Could you explain why you chose this reward formula? I really appreciate that. @galnov @galleibo-intel @shadiendrawis @itaicaspi

fangchuan commented 5 years ago

2019-04-02 15-36-10屏幕截图

galnov commented 5 years ago

As the agent in the CARLA preset in Coach does not have a destination goal (unlike in the CARLA paper), we built it so it will learn to drive as much as possible. The reward was defined such that the agent will be encouraged to drive as fast as it can without colliding or intersecting other lanes or going off road. To stabilize the drive, we also discourage unnecessary steering, hence the negative impact of steering on the reward.

fangchuan commented 5 years ago

@galnov ,Im grateful for your reply. yeah, I have tried the preset CARLA_DDPG, and it was trained about 1M steps until the agent converged. Then I want to make it possible for task with a fixed destination. I have revised the carla_environment.py, in several points: the task is an curve trajectory between start_position and end_position in Town01. observation_space = Tuple([image_space, measurement_space]) measurement_space = [higher_command, forward_speed, distance_to_goal, is_collision] distance_reward = delta_distance = previous_distance_to_goal - current_distance_to_goal; distance_reward = np.clip(distance_reward, -10,10) reward = distance_reward + speed_reward - ...

And, I also modify the InputEmbedderLayer used in CARLA_DDPG, like the actor network architecture in the png, critic network has same InputEmbedderLayer. Now, my problem is the agent does not seem to converge, and it have learned something unexpected. The agent has learned for about 800000 steps, however, it haven't learned how to turn right, could you help me figure out what's wrong with my solution? Please, I'm almost crazy...

fangchuan commented 5 years ago

oneline_actor_network

fangchuan commented 5 years ago

Is it caused by the choice of measurements data? I mean, only the higher_command, forward_speed, distance_to_goal seems not enough for an approximator(DNN) to output a reference trajectory. The higher_command isn't compatible of MDP, what should I organize the measurements data? Add the current location of agent in the measurements data?

galnov commented 5 years ago

I suggest you take a look at the Conditional Imitation Learning agent for an example on how to train an agent with high level commands. It implements this paper.

fangchuan commented 5 years ago

well, I really appreciate your suggestion. @galnov