In a multi-agent environment, the convergence issue.

Mingjie-He commented 1 year ago

Thank you for your excellent work. However, when I train multiple agents in a highway environment, the network doesn't seem to converge. The checkpoint rewards obtained after training for every thousand episodes do not follow a consistent pattern. Sometimes they are high, while other times they are low.Here are the specific results: 1684719990991 I want to ask how I can determine if the baseline network of this DQN has converged. I look forward to your response.@eleurent

The code I used for the experiment is from the B-GAP article which primarily utilizes the "rl-agent" repository.The environment configuration file used is "env_multi_agent.json," and the agent configuration file is "dqn.json."

Mingjie-He commented 1 year ago

In the results table, the columns represent the model, the date of execution, the checkpoint file for which collection, the average episode duration after running 100 episodes, the speed, and the average reward.

Mingjie-He commented 1 year ago

I apologize for my rather confusing way of asking the question.I have plotted a line graph of the rewards from three multi-agent DQN experiments in the highway-env environment, and it is evident that they are highly unstable. Additionally, I have made some modifications to the code for better readability. I would greatly appreciate it if you could provide some insights or guidance on this matter. Thank you in advance. Figure_1 env_multi_agent.json: { "id": "highway-v0", "import_module": "highway_env", "controlled_vehicles": 3, "action": { "type": "MultiAgentAction", "action_config": { "type": "DiscreteMetaAction" } }, "observation": { "type": "MultiAgentObservation", "observation_config": { "type": "Kinematics", "absolute": "True", "normalize": "Flase" } } } and dqn.json: { "__class__": "<class 'rl_agents.agents.deep_q_network.pytorch.DQNAgent'>", "model": { "type": "MultiLayerPerceptron", "layers": [256, 256] }, "double": false, "gamma": 0.8, "n_steps": 1, "batch_size": 32, "memory_capacity": 15000, "target_update": 50, "exploration": { "method": "EpsilonGreedy", "tau": 6000, "temperature": 1.0, "final_temperature": 0.05 }, "loss_function": "l2" }

eleurent commented 1 year ago

Hi, One one hand, it is quite usual that an RL policy would exhibit some variance. For instance, the initial states are random, which means that some episodes can be easier than others. On the other hand, we should definitely still expect the mean reward to increase throughout training. The plots you are showing looks quite suspicious, they should not be so flat. I trained the multi-agent setting only once as far as I remember, but I definitely had the mean rewards (and corresponding qualitative behaviour) improving significantly compared to the initial policy.

eleurent commented 1 year ago

I could find some data from an old training run from 2020, here are some videos:

https://github.com/Farama-Foundation/HighwayEnv/assets/1706935/3338b498-bc9f-4584-a181-a1f02b7b1b7f

episode 1: attention is uniform, behaviour is random

https://github.com/Farama-Foundation/HighwayEnv/assets/1706935/7abc04a9-02e4-4d2e-9369-970da3a32f8c

episode 512: attention more focused, still collisions

https://github.com/Farama-Foundation/HighwayEnv/assets/1706935/d3a23e57-5ce7-4de1-adbb-9f2c74d1a3d0

episode 2000: safer policy

I also have the training stats: here is the reward average on a window of 50 episodes: Figure_1

But the final policy was by no means perfect, it still had some variance and some collisions.

luiswirth commented 2 months ago

I believe there is a regression somewhere in the code of either HighwayEnv or rl-agents that broke the MultiAgent setting for the IntersectionEnv. It just won't converge. Could someone who has more understanding of this project please confirm? Maybe @eleurent ?

Farama-Foundation / HighwayEnv

In a multi-agent environment, the convergence issue. #472