PaddlePaddle / MetaGym

Collection of Reinforcement Learning / Meta Reinforcement Learning Environments.
Apache License 2.0
276 stars 59 forks source link

In Velocity Control task, the next expected velocity is in global or body coordinate? #30

Closed Harryoung closed 4 years ago

Harryoung commented 4 years ago

As README says, the next expected velocity is in the global coordinate.

For "velocity_control" task, it has:

next_target_g_v_x: next expected velocity of the drone in x direction, in global coordinate. next_target_g_v_y: next expected velocity of the drone in y direction, in global coordinate. next_target_g_v_z: next expected velocity of the drone in z direction, in global coordinate.

But in the code, it seems to be in the body coordinate. https://github.com/PaddlePaddle/RLSchool/blob/d4fdae4c875638a0ca7018b70fd1ce0ac35938ca/rlschool/quadrotor/env.py#L262-L267 Above self.velocity_targets is in the body coordinate. https://github.com/PaddlePaddle/RLSchool/blob/d4fdae4c875638a0ca7018b70fd1ce0ac35938ca/rlschool/quadrotor/quadrotorsim.py#L320

xueeinstein commented 4 years ago

@Harryoung Thanks for your feedback! The expected velocity definitely should be global velocity, otherwise, its direction corresponds to the flighting pose and can be ambiguous. The render is also visualizing global velocity, see https://github.com/PaddlePaddle/RLSchool/blob/d4fdae4c875638a0ca7018b70fd1ce0ac35938ca/rlschool/quadrotor/render.py#L552-L557

However, the code snippet you referred does have errors, I made a hotfix in PR-31. Please consider using this patch before it merges and releases.

Harryoung commented 4 years ago

@xueeinstein Yes! Your fix is right. I debugged for a whole day and found something that could also be wrong!

  1. The next_target_g_v is actually in body coordinate. (as mentioned by this issue)
  2. The velocity_lst initiated in define_velocity_control_task is actually in a condition where the initial value of velocity, position, acc and so on are all set to 0 except for acc_z because of gravity. https://github.com/PaddlePaddle/RLSchool/blob/d4fdae4c875638a0ca7018b70fd1ce0ac35938ca/rlschool/quadrotor/quadrotorsim.py#L306-L310 But when we train the velocity control task, when using env.reset(), the initial values are set with a noise configured in config.json. https://github.com/PaddlePaddle/RLSchool/blob/d4fdae4c875638a0ca7018b70fd1ce0ac35938ca/rlschool/quadrotor/quadrotorsim.py#L239-L254 This makes the target velocity actually wrong even with the exact same action as randomed in define_velocity_control_task. Maybe one can say that neural network can cover this little "error", but I will say that this error is accumulative and will make the network very hard to converge.
  3. According to the code, the reward for velocity control task consists of three parts, which are energy cost, health reward and velocity difference. https://github.com/PaddlePaddle/RLSchool/blob/d4fdae4c875638a0ca7018b70fd1ce0ac35938ca/rlschool/quadrotor/env.py#L210-L216 Here healthy_reward is set to 1, energy cost is between 0.002 and 42.53, which is calculated by setting the four voltages to min_voltage and max_voltage, respectively. The problem is that the cost of velocity difference is relatively small, and is usually covered by the other two costs. So finally the network will probably learn to set low voltage to save energy and will not care about the target velocity at all. My idea is to change the factor 0.001 to a bigger one, such as 0.1. Now my test is just running.
xueeinstein commented 4 years ago

@Harryoung

  1. Maybe one can say that neural network can cover this little "error", but I will say that this error is accumulative and will make the network very hard to converge.

I don't agree that the error is accumulative. Actually, if you have trained the hovering control task well, you will found no matter the random initial velocity, the drone can quickly adapt to a velocity closing zero. It means that the dynamics setting makes the motors control the velocity sensitively. The target velocity sequence is sampled by a random policy, so it's not likely to be a very large one at the beginning (think about a random walk). Intuitively, it's still under the capability of the motors to recover the target velocity.

  1. The problem is that the cost of velocity difference is relatively small, and is usually covered by the other two costs. So finally the network will probably learn to set low voltage to save energy and will not care about the target velocity at all. My idea is to change the factor 0.001 to a bigger one, such as 0.1.

It's probably right! For some reason, we only tested the velocity control task long ago, before v0.3.1 and v0.3.0 which introduced noisy initial state and replaced global velocity with body velocity in observation. Thus, the coefficient 0.001 may not be proper now. We may add comprehensive baselines testing for noisy startup state later on.

Extra Tips: currently, your RL agent controls the drone by given its own body velocity and target global velocity of the next timestamp, it may be hard to figure out the relation between reward signals and such an observation setting. Fortunately, the pose of the drone is also given, maybe if you add a preprocessing tick to convert body velocity to global velocity, it will be much easy to converge.

Harryoung commented 4 years ago

@xueeinstein About the 2nd idea, I think you are right ^_^! If the initial state is same between target and train, the task becomes less meaningless. About the Extra Tips, I have implemented this and I really wish an easier convergence as you said.