Reward function mismatch between paper and code

anahrendra commented 1 year ago

Hi! I just noticed that there is a difference in the reward function between the paper and the code in this repo. In the paper, it is written as follows: Screenshot from 2023-01-31 22-04-56

However, the code implementation is `

def _reward_tracking_contacts_shaped_force(self):
    foot_forces = torch.norm(self.env.contact_forces[:, self.env.feet_indices, :], dim=-1)
    desired_contact = self.env.desired_contact_states

    reward = 0
    for i in range(4):
        reward += - (1 - desired_contact[:, i]) * (
                    1 - torch.exp(-1 * foot_forces[:, i] ** 2 / self.env.cfg.rewards.gait_force_sigma))
    return reward / 4

def _reward_tracking_contacts_shaped_vel(self):
    foot_velocities = torch.norm(self.env.foot_velocities, dim=2).view(self.env.num_envs, -1)
    desired_contact = self.env.desired_contact_states
    reward = 0
    for i in range(4):
        reward += - (desired_contact[:, i] * (
                    1 - torch.exp(-1 * foot_velocities[:, i] ** 2 / self.env.cfg.rewards.gait_vel_sigma)))
    return reward / 4

`

The 1-torch.exp() part makes it different with the ones in the paper. Could you tell me which one is the correct one?

Thanks!

gmargo11 commented 1 year ago

Hi @anahrendra , thanks for this note. You're right that the implementation and paper differ slightly, and I didn't catch this before! The code is correct, the paper contains a mistake here.

I think the effect of this should be a constant scaling of the reward function, since these two reward terms are added together at every timestep: $[(1.0-C\text{foot}^\text{cmd}) + (C\text{foot}^\text{cmd})] = 1.0$. So, I would expect the impact of this change on the final policy to be minor. But do let me know if you see otherwise (there are a lot of operations being composed on this term)

-Gabe

anahrendra commented 1 year ago

Hi! Thanks for your implementation. That makes sense.

However, I tried to run a training with your code (nothing is changed at all), but I could not obtain the results as what you have in the pretrained weight. A bit of summary is as follows:

4000 iterations, still able to walk, but cannot track given gaits well
45000 iterations, completely fail to even hold its position.

Could you by any chance directly train using your github code to verify if there is something missing? And how many iterations are required to produce a good result?

Thanks in advance for your help!

gmargo11 commented 1 year ago

@anahrendra , I did some local testing and I suspect this is because of the larger gravity range in this repo's default config: https://github.com/Improbable-AI/walk-these-ways/blob/master/scripts/train.py#L49

In the paper, we used max gravity randomization 1.0 but I seem to have provided the config with more challenging max gravity randomization of 2.0 here. Training with this range might be unstable. Sorry about that!

Please try changing line 49 of train.py to:

Cfg.domain_rand.gravity_range = [-1.0, 1.0]

And let me know if that fixes the issue. On my machine, it converges after around 10k iterations.

By the way, the friction range in this codebase is also a bit wider than in the paper, and the max footswing height is higher. See appendix Table 6 https://arxiv.org/pdf/2212.03238.pdf for the values used in the paper. I may just update the repo in a bit to have all the original parameters

P.S. to debug with a bit faster convergence and verify that everything else is working properly, you can try turning off gravity randomization entirely ( Cfg.domain_rand.gravity_range = [-0.0, 0.0]).

anahrendra commented 1 year ago

Hi!

Thanks a lot for your detailed support. I will try your fix as soon as possible and let you know about the results.

Have a nice weekend!

gmargo11 commented 1 year ago

I have updated the default parameters in train.py (728058d24303cd209e5d82d8d2c0c240a40d2841)

Improbable-AI / walk-these-ways

Reward function mismatch between paper and code #8