Replicable-MARL / MARLlib

One repository is all that is necessary for Multi-agent Reinforcement Learning (MARL)
https://marllib.readthedocs.io
MIT License
934 stars 150 forks source link

Discrete action space switching continuous action space problem in custom environment #217

Open shengqie opened 9 months ago

shengqie commented 9 months ago
   Hello developers, I am trying to customize aircombat env, the aircombat environment included in MARLlib, but I have encountered some problems in the post-customization training process, specifically.
   First, I set the 2v2 scene defined by the environment as a competitive multi-agent air combat scene of 1v1. Training with ippo produced a certain effect. Then, I changed the mutidiscrete actionspace defined by the environment into a continuous actionspace as follows:
                                        self.action_space = spaces.Box(low=-10, high=10., shape=(4,))
    However, after defining the action space as continuous, ippo, maddpg, mappo and other algorithms I used in MARLlib had extremely poor training effect, unable to produce effective strategies, and their reward curve could not produce an upward trend and could not converge.
    I have used mappo and other algorithms to achieve similar implementation, so I don't think it is the environment that leads to the failure of the algorithm. I would like to ask if you have any opinions, and whether MARLlib may have special code writing specifications for continuous action space, which I am not familiar with. Thank you for your answer.

    开发者您好,我试图对MARLlib所包含的空战环境aircombat自定义,但自定义后训练过程遇到了一些问题,具体来说是这样的。
    首先我将环境所定义的2v2场景设置为1v1的竞争多智能体空战场景,此时使用ippo进行训练后产生了一定效果,随后,我将环境所定义的mutidiscrete actionspace变为连续动作空间形式,具体为:
                                         self.action_space = spaces.Box(low=-10, high=10., shape=(4,))
    然而,在将动作空间定义为连续化后,我使用MARLlib的ippo、maddpg、mappo等算法均训练效果极差,无法产生有效策略,其奖励曲线也无法产生上升趋势,并且无法收敛。
   类似的实现我曾经使用过mappo等算法完成,所以我并不认为是环境的原因导致算法的失效,想请问您是否有什么见解,是否可能是MARLlib对于连续动作空间有特殊的代码书写规范,而我对此并不熟知,感谢您的回答。
shengqie commented 9 months ago

I would like to add one more thing to you. After converting the action space to a continuous action space, the algorithm encountered an error after iterating thousands of times:

向您补充一点,在我将动作空间转换为连续动作空间后,算法迭代数千次以后出现了报错:

Failure # 1 (occurred at 2024-01-26_02-47-08) Traceback (most recent call last): File "/home/user/miniconda3/envs/marllib/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 890, in _process_trial results = self.trial_executor.fetch_result(trial) File "/home/user/miniconda3/envs/marllib/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 788, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/home/user/miniconda3/envs/marllib/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/home/user/miniconda3/envs/marllib/lib/python3.8/site-packages/ray/worker.py", line 1625, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(ValueError): ray::IPPOTrainer.train_buffered() (pid=1138520, ip=10.31.22.121, repr=IPPOTrainer) File "/home/user/miniconda3/envs/marllib/lib/python3.8/site-packages/ray/rllib/agents/ppo/ppo_torch_policy.py", line 46, in ppo_surrogate_loss curr_action_dist = dist_class(logits, model) File "/home/user/miniconda3/envs/marllib/lib/python3.8/site-packages/ray/rllib/models/torch/torch_action_dist.py", line 186, in init self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std)) File "/home/user/miniconda3/envs/marllib/lib/python3.8/site-packages/torch/distributions/normal.py", line 50, in init super(Normal, self).init(batch_shape, validate_args=validate_args) File "/home/user/miniconda3/envs/marllib/lib/python3.8/site-packages/torch/distributions/distribution.py", line 53, in init raise ValueError("The parameter {} has invalid values".format(param)) ValueError: The parameter loc has invalid values