Design neural network - Githubissues

Algorithms:

Independent Q-Learning: Each agent uses Q-learning independently, treating other agents as part of the environment. It ignores the impact of its actions on other agents, which can lead to non-cooperative behavior and suboptimal outcomes.

“Learning to communicate: Independent Q-learning for distributed reinforcement learning" by C. Watkins, P. Dayan (1998).

Q-Learning with Joint Actions: Agents learn joint action-value functions, taking into account the joint actions of all agents. This approach considers the interdependencies between agents, allowing for coordinated behavior and better performance.

"Learning to coordinate behaviors" by M. L. Littman (1994).

Coordinated Q-Learning: Agents maintain separate action-value functions, but they communicate and coordinate their actions to achieve better overall performance. This algorithm incorporates limited forms of communication or coordination between agents.

"Markov games as a framework for multi-agent reinforcement learning" by M. L. Littman (1994).

Q-Learning with Centralized Training and Decentralized Execution (Q-Learning CTDE): Agents learn individually, but during training, they have access to additional information, such as the states and actions of other agents. During execution, each agent acts based on its own learned policy, without requiring communication or coordination.

"Value decomposition networks for cooperative multi-agent learning" by A. Foerster et al. (2018).

Counterfactual Multi-Agent Policy Gradient (COMA): COMA is a policy gradient-based algorithm that explicitly accounts for the impact of an agent's actions on other agents' rewards. It addresses the credit assignment problem in MARL by using a counterfactual baseline to estimate the effect of an agent's action on the collective reward.

"Counterfactual multi-agent policy gradients" by J. Foerster et al. (2017).

Multi-Agent Deep Deterministic Policy Gradient (MADDPG): MADDPG extends the DDPG algorithm to MARL. It utilizes actor-critic networks and experience replay to learn individual policies for each agent. MADDPG introduces a centralized critic that incorporates the actions and observations of all agents to estimate the value function.

"Multi-agent actor-critic for mixed cooperative-competitive environments" by R. Lowe et al. (2017).

Multi-Agent Proximal Policy Optimization (MAPPO): MAPPO is a multi-agent extension of the Proximal Policy Optimization (PPO) algorithm. It uses centralized value functions for training and decentralized policies for execution. MAPPO addresses exploration, credit assignment, and coordination in MARL.

"Multi-agent actor-critic for mixed cooperative-competitive environments" by R. Lowe et al. (2017).

Hierarchical Cooperative Multi-Agent Deep Reinforcement Learning (H-MAC): H-MAC combines hierarchical reinforcement learning with multi-agent cooperation. It introduces a hierarchy of policies to learn high-level coordination and low-level individual agent behaviors simultaneously.

"Hierarchical deep multi-agent reinforcement learning" by W. Wang et al. (2018).

Multi-Agent Reinforcement Learning: Explore more complex multi-agent scenarios, such as larger team sizes or cooperative tasks that involve multiple agents working together towards a common goal. Investigate the use of communication and coordination strategies to improve team performance.

Hierarchical Reinforcement Learning: Study the application of hierarchical RL approaches to football scenarios, where agents learn to operate at different levels of abstraction, such as low-level ball control and high-level strategic decision-making.

Transfer Learning: Investigate how agents trained in simplified scenarios or on smaller teams can transfer their learned policies to more complex environments, such as full 11-vs-11 games.

Heuristic Learning in Multi-Agent Systems: Heuristic learning involves using expert knowledge or rules to guide the learning process of agents.

Evolutionary Algorithms for Multi-Agent Systems: Evolutionary algorithms can be used to evolve agent policies or strategies over generations.

Multi-Agent Policy Gradient: Policy gradient methods are used to optimize the policies of multiple agents in a cooperative or competitive setting.

Communication and Coordination Learning: Learning to communicate and coordinate actions is crucial in multi-agent systems, especially when agents need to work together.

Benchmarking with Human Players: Conduct experiments where RL agents are trained and evaluated against human players to better understand the capabilities and limitations of RL algorithms in competitive settings.

202219807 / 700099_MSC_22_039

Design neural network #6