Farama-Foundation / Gymnasium

An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)
https://gymnasium.farama.org
MIT License
7.34k stars 820 forks source link

[Bug Report] `MuJoCo.Ant` contact forces being off by default is based on a wrong experiment #214

Closed Kallinteris-Andreas closed 1 year ago

Kallinteris-Andreas commented 1 year ago

Describe the bug

The problem

due to https://github.com/openai/gym/pull/2762#issuecomment-1135362092 it was decided that use_contact_forces would default to False, but The 2 different problem factorizations, used DIFFERENT REWARD FUNCTIONS

As you can see here, the reward functions are indeed different: https://github.com/rodrigodelazcano/gym/blob/9c9741498dd0b613fb2d418f17d77ab5f6e60476/gym/envs/mujoco/ant_v4.py#L264

This behavior (of differing rewards functions) is also not documented at all (i can make a PR for that)

@rodrigodelazcano

Code at that commit: (it is same as the current code, as far we are concerned, with our current problem) https://github.com/rodrigodelazcano/gym/blob/9c9741498dd0b613fb2d418f17d77ab5f6e60476/gym/envs/mujoco/ant_v4.py

Code example

No response

System info

No response

Additional context

No response

Checklist

pseudo-rnd-thoughts commented 1 year ago

To confirm, the issue is that the reward function is dependant on the use_contact_forces which is not documented

Kallinteris-Andreas commented 1 year ago

@pseudo-rnd-thoughts There are 2 issues 1) The undocumented reward functions differences with use_contact_forces 2) The fact that use_contact_forces defaults to =False based on a wrong experiment (this should be investigated for a potential MuJoCo-v5 or Brax based environments)

pseudo-rnd-thoughts commented 1 year ago

Ok, could you make a PR to address 1 Could you explain why it was based on the wrong experiment? I wasn't involved and don't know anything about the experiments. What did they do? What is the issue with that? What do you propose the solution should be? What does your experiment show?

Kallinteris-Andreas commented 1 year ago

Ok so what happened (from what I understand it, I would like @rodrigodelazcano to comment here)

The problem of this, is that unlike other MuJoCo environments the Ant is significantly different between v2/v3 and v4. (I have not looked into if this has concluded in wrong conclusions in research paper)

The immediate step would be to run an ablation study with the same reward function with/without mujoco.cfrc_ext. I do not have computing power to spare to run this experiment anytime soon.

I do not know what the best solution to this problem is for MuJoCo environments. A possible solution would be to make ant-v5 (that would have the same reward function as ant-v2/v3)

At the very least, we should know if external forces matter for Ant environments for Gymnasium.Brax

rodrigodelazcano commented 1 year ago

@Kallinteris-Andreas yes, you are right. We missed taking into account the fact that adding/removing contact_cost to the reward function would affect the return differently in the long run. The initial justification of having a performance degradation due to contact forces in the observations is wrong, sorry about that. I'll run this weekend the suggested experiments to make sure that's the case.

The past versions of the environment (v2/v3) are being kept for reproducibility of past research. We haven't modified anything and they can still be used with mujoco_py and older versions of mujoco.

As you've mentioned v4 environments upgrade to use the latest mujoco bindings instead of mujoco-py since it's no longer maintained. In the process we also decided to fix the external contact forces issue you've mentioned that appeared with later mujoco versions mujoco>=2.0. Thus the reward function of v4 is no different from v2/v3 if external contact forces are included in observation and reward, with use_contact_forces.

However, because we observed successful learning without contact forces in https://github.com/openai/gym/pull/2762#issuecomment-1135362092, we decided to make the use of external forces (observation/reward) optional. Having said this I don't think there is a need for a v5 version other than your documentation updates mentioning that using contact forces will affect the reward function, which is highly important and I thank you for finding this out. What do you think @pseudo-rnd-thoughts?

pseudo-rnd-thoughts commented 1 year ago

Thanks both of you, my proposed answer would be to add a note on the ant documentation to note this difference between v2/3 and v4 (with default parameters) and using use_contact_forces for equivalence between mujoco-py and mujoco bindings. Additionally, once the experiments are completed, then we need to add a comment to the original PR which we can link in the documentation. This should avoid the need for a v5 environment but if users read the documentation they can understand the difference between versions, etc

Kallinteris-Andreas commented 1 year ago

Surprisingly including the contact forces to the observation space, does not appear to have any significant impact to the average performance but including the contact forces significantly reduces the training variance.

I think the default should to observe the contact forces, as stable training performance is important for evaluating training algorithms.

Note: also the worldbody contact forces were asserted to be 0 through the entire process

rodrigodelazcano commented 1 year ago

@Kallinteris-Andreas Thanks for taking the time to run the experiment. Sounds good setting the default to use contact forces, I approve. Can you make this change to the v4 environment in Gymnasium.

The removal of the worldbody elements can be done in v5.

Kallinteris-Andreas commented 1 year ago

I do not think we should change the defaults on v4 since users would use and get a different environment when they do:

import gymnasium
import gymnasium_robotics
env = gymnasium.make('Ant-v4')

depending on the gymnasium and gymnasium_robotics versions

also keep in mind that in v4 changing use_conctact_focres would also change the rewards

It would create unnecessary confusion