Closed Kallinteris-Andreas closed 1 year ago
To confirm, the issue is that the reward function is dependant on the use_contact_forces
which is not documented
@pseudo-rnd-thoughts
There are 2 issues
1) The undocumented reward functions differences with use_contact_forces
2) The fact that use_contact_forces
defaults to =False
based on a wrong experiment (this should be investigated for a potential MuJoCo-v5 or Brax based environments)
Ok, could you make a PR to address 1 Could you explain why it was based on the wrong experiment? I wasn't involved and don't know anything about the experiments. What did they do? What is the issue with that? What do you propose the solution should be? What does your experiment show?
Ok so what happened (from what I understand it, I would like @rodrigodelazcano to comment here)
mujoco-py
bindings the external forces (mujoco.cfrc_ext
) were not working which resulted in all the those observations being 0 and contact cost == 0
mujoco
bindings they added an option to disable external forces (mujoco.cfrc_ext
) (use_contact_forces
) which alters both the observations and the reward function (by disabling contact_cost
)mujoco.cfrc_ext
(use_contact_forces=False
) would provide better results (here they made the crucial mistake of comparing 2 implementations with 2 different reward functions because they forgot that use_contact_forces alters the reward function) https://github.com/openai/gym/pull/2762#issuecomment-1135362092mujoco.cfrc_ext
provided significantly bigger returns (which makes sense since they disabled the biggest cost from the reward)The problem of this, is that unlike other MuJoCo environments the Ant is significantly different between v2/v3 and v4. (I have not looked into if this has concluded in wrong conclusions in research paper)
The immediate step would be to run an ablation study with the same reward function with/without mujoco.cfrc_ext
.
I do not have computing power to spare to run this experiment anytime soon.
I do not know what the best solution to this problem is for MuJoCo environments.
A possible solution would be to make ant-v5
(that would have the same reward function as ant-v2/v3)
At the very least, we should know if external forces matter for Ant
environments for Gymnasium.Brax
@Kallinteris-Andreas yes, you are right. We missed taking into account the fact that adding/removing contact_cost
to the reward function would affect the return differently in the long run. The initial justification of having a performance degradation due to contact forces in the observations is wrong, sorry about that. I'll run this weekend the suggested experiments to make sure that's the case.
The past versions of the environment (v2/v3) are being kept for reproducibility of past research. We haven't modified anything and they can still be used with mujoco_py
and older versions of mujoco.
As you've mentioned v4 environments upgrade to use the latest mujoco bindings instead of mujoco-py
since it's no longer maintained. In the process we also decided to fix the external contact forces issue you've mentioned that appeared with later mujoco versions mujoco>=2.0
. Thus the reward function of v4 is no different from v2/v3 if external contact forces are included in observation and reward, with use_contact_forces
.
However, because we observed successful learning without contact forces in https://github.com/openai/gym/pull/2762#issuecomment-1135362092, we decided to make the use of external forces (observation/reward) optional. Having said this I don't think there is a need for a v5 version other than your documentation updates mentioning that using contact forces will affect the reward function, which is highly important and I thank you for finding this out. What do you think @pseudo-rnd-thoughts?
Thanks both of you, my proposed answer would be to add a note on the ant documentation to note this difference between v2/3 and v4 (with default parameters) and using use_contact_forces
for equivalence between mujoco-py and mujoco bindings.
Additionally, once the experiments are completed, then we need to add a comment to the original PR which we can link in the documentation. This should avoid the need for a v5 environment but if users read the documentation they can understand the difference between versions, etc
Surprisingly including the contact forces to the observation space, does not appear to have any significant impact to the average performance but including the contact forces significantly reduces the training variance.
I think the default should to observe the contact forces, as stable training performance is important for evaluating training algorithms.
Note: also the worldbody
contact forces were asserted to be 0 through the entire process
@Kallinteris-Andreas Thanks for taking the time to run the experiment. Sounds good setting the default to use contact forces, I approve. Can you make this change to the v4 environment in Gymnasium.
The removal of the worldbody
elements can be done in v5.
I do not think we should change the defaults on v4
since users would use and get a different environment when they do:
import gymnasium
import gymnasium_robotics
env = gymnasium.make('Ant-v4')
depending on the gymnasium
and gymnasium_robotics
versions
also keep in mind that in v4
changing use_conctact_focres
would also change the rewards
It would create unnecessary confusion
Describe the bug
The problem
due to https://github.com/openai/gym/pull/2762#issuecomment-1135362092 it was decided that
use_contact_forces
would default toFalse
, but The 2 different problem factorizations, used DIFFERENT REWARD FUNCTIONSAs you can see here, the reward functions are indeed different: https://github.com/rodrigodelazcano/gym/blob/9c9741498dd0b613fb2d418f17d77ab5f6e60476/gym/envs/mujoco/ant_v4.py#L264
This behavior (of differing rewards functions) is also not documented at all (i can make a PR for that)
@rodrigodelazcano
Code at that commit: (it is same as the current code, as far we are concerned, with our current problem) https://github.com/rodrigodelazcano/gym/blob/9c9741498dd0b613fb2d418f17d77ab5f6e60476/gym/envs/mujoco/ant_v4.py
Code example
No response
System info
No response
Additional context
No response
Checklist