Farama-Foundation / Gymnasium-Robotics

A collection of robotics simulation environments for reinforcement learning
https://robotics.farama.org/
MIT License
505 stars 82 forks source link

[Question] Which of the MAMuJoCo environments are even "solvable"? #141

Open jcformanek opened 1 year ago

jcformanek commented 1 year ago

Question

TL;DR: do you have baselines for performance on the environments using some popular MARL algorithm, say MADDPG or other?

Hi there, first of all, thanks for maintaining MAMuJoCo. I have been experimenting with it for a few weeks now but am struggling to "solve" several of the scenarios using MATD3 / MADDPG. I was wonder if you have any baselines for the environments, i.e., demonstrated that they can be "solved" using some MARL algorithm? By "solved" I just mean some non-trivial return. In particular, my algorithm quickly learn to get a score of around 800 and 1000 on Ant and HalfCheetah respectively but fail to break out out that local optima until I added qvel,qpos to the global_categories. After adding qvel,qpos I now get scores ~3000 and ~6000 respectively. I originally tried this because I suspected there was some important information missing in the agent observations after I reduced the problem to a single agent task on the joint-observation and joint-action and my TD3 implementation could not solve it.

I am now struggling to "solve" 2-agent Walker and 3-agent Hopper. I tried adding more values to the global_categories (qvel,qpos,cinert,cvel,qfrc_actuator,cfrc_ext) but my algorithm seems stuck around the ~500 and ~200 return mark. Because of my experience with Ant and HalfCheetah I fear there is some important information omitted from the joint-observation, making it impossible to solve.

To hopefully rule this out and narrow down the problem to a bug in my implementation I was hoping you had some kind of baseline for performance on these environments. I tried to refer to the results reported in other papers using MAMuJoCo but they all seem to use non-default settings for the scenarios which in some cases make the environment no longer a decentralised partially observed multi-agent environment. For example this paper gives each of the agents access to the state of the environment as their observation. I would like to avoid this and only give agents access to their partial local observations. However, I feel that if a single agent RL algorithm can't solve the tasks on the joint-observation, then its unrealistic to expect a MARL algorithm to succeed. What do you think?

I look forward to hearing from you.

Kallinteris-Andreas commented 1 year ago

Hey, sorry, I am in the process of validating the environments (which is why they are not included yet in a release)

1)Here is a single run of Hopper Hopper I will publish a report once I have validated all the environments my code is at: (for now, I will rename the title at some point) I have written my own TD3/MATD3 implementation https://github.com/Kallinteris-Andreas/ai

2)I have verified that factorization=None works exactly the same a Gymnasium/MuJoCo (I do not have graphs about that)

3)Based on the way you are asking about the observation categories, there seems to some confusion on the supported observation types, I will update the documentation, to fix that.

4)I have not read the paper you have provided but giving each agent the full observability makes factorization pointless (it is like solving the single agent with extra steps)

jcformanek commented 1 year ago

Thank you so much for the speedy response. This is helpful and I look forward to the outcomes of your investigation.

Just to be clear, in the plot above, the Hopper environment was configured as agent_conf=3x1, agent_obsk=1 and you added no global_categories?

If you do not mind, I would also appreciate if you could clarify what global_categories and local_categories are.

Kallinteris-Andreas commented 1 year ago

1) The default global categories for Hopper 3x11 is ("qpos", "qvel") (note this was not the case in original MaMuJoCo, or in any other port from what I can tell), which makes more sense since all the benchmarks use it. 2) yes, I have used the agent_conf=3x1, agent_obsk=1 but with the default ("qpos", "qvel") global_categories

domain:
  name: Hopper
  factorization: 3x1 # agent factorization used, check MaMuJoCo Doc for more info
  obsk: 1 # check MaMuJoCo Doc for more info
  total_timesteps: 2_000_000 # how many learn steps the agent should take
  #episodes: 1000
  algo: TD3 # Valid values: 'DDPG', 'TD3', 'MADDPG'
  init_learn_timestep: 25001 # at which timestep should the agent start learning
  #learning_starts_ep: 10 # Start Learning at episode X, before that fill the ERB with random actions
  evaluation_frequency: 5000 # how ofter should the agent be evaluated
  runs: 10 # number of statistical runs
  seed: 64 # seeds the enviroment
DDPG:
  gamma: 0.99 # Reward Discount rate
  tau: 0.01 # Target Network Update rate
  N: 100 # Experience Replay Buffer's mini match size
  experience_replay_buffer_size: 1000000
  sigma: 0.1 # standard deviation of the action process for exploration
  optimizer_gamma: 0.001 # the learning rate of the optimizers
  mu_bias: True # Bias for the actor module
  q_bias: True # Bias for the critic module
TD3:
  gamma: 0.99 # Reward Discount rate
  tau: 0.005 # Target Network Update rate
  N: 256 # Experience Replay Buffer's mini match size
  experience_replay_buffer_size: 1000000
  sigma_policy: 0.2 # Standard deviation of the action process for policy update
  sigma_explore: 0.1 # Standard deviation of the action process for exploration
  optimizer_gamma: 0.001 # The learning rate of the optimizers
  noise_policy_clip: 0.5 # Clamping for the target noise
  d: 2 # Policy Update Frequency
  mu_bias: True # Bias for the actor module
  q_bias: True # Bias for the critics module

3) WIP https://github.com/Farama-Foundation/Gymnasium-Robotics/pull/142

jcformanek commented 1 year ago

I noticed in your MATD3 implementation that you use the environment state in the critic instead of the joint observation. Do you think that the environments should be solvable given the joint observation but not the environment state? Or is it by design that an algorithm should incorporate the environment state information in order to succeed? One problem I have with this is that it limits the degree to which one could use this environment for evaluating independent learners because there will always be missing information for the independent learners.

I would be really interested to know what you think.

Kallinteris-Andreas commented 1 year ago

1) In my case (with the default environment arguments) the agent joint observation == globally observable space. 2) There is no problem conceptually with the critic incorporating more information that the actor (centralized training decentralized execution CTDE), in fact this is how cooperative MA-DDPG and MA-TD3 is supposed to work 3) I suspect independent learners (i.e. with the critic only receiving local observations and actions) would no work, those environments since they are a multi-agent factorization of an existing problem (their actions are very interlinked) 4) If you are using independent agents then your algorithm is not MA-TD3 but I-TD3 5) for some observation categories configuration I suspect that joint observation (even if it is a subset of the total observable state) will be able to solve the problem, keep in mind that is sometimes done single agent environments such as MuJoCo Ant 6) can you review this (part of the updated docs) """ local_categories: The categories of local observations for each observation depth, It takes the form of a list where the k-th element is the list of observable items observable at the k-th depth For example: if it is set to [["qpos, qvel"], ["qvel"]] then means each agent observes its own position and velocity elements, and it's neighbors velocity elements. The default is: Check each environment's page on the "observation space" section. global_categories: The categories of observations extracted from the global observable space, For example: if it is set to ("qpos") out of the globally observable items of the environment, only the position items will be observed. The default is: Check each environment's page on the "observation space" section. """

Thanks!

jcformanek commented 1 year ago

Thanks for the detailed response. I think your first point speaks to what I wanted to verify, namely that the intended design is that all the relevant information in the environment's state (i.e. the underlying single agent MuJoCo state) is also contained in the joint observation of the MA MuJoCo environment (possibly duplicated a couple times from each agents observation).

Kallinteris-Andreas commented 1 year ago

@jcformanek if you want to try decentralized training methods, I would recommend you starting with HalfCheetah since it does not terminate (and therefore do not have to assign blame on cause of a terminal state)