Theohhhu / UPDeT

Official Implementation of 'UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers' ICLR 2021(spotlight)
MIT License
128 stars 17 forks source link

How to reproduce the transfer learning #15

Closed hellofinch closed 1 year ago

hellofinch commented 2 years ago

hello, I'm interested in your work. I want to reproduce the transfer learning result. As you mentioned, it can be deployed to other scenarios without changing the model's architecture. And there is a figure given. I want to reproduce it.

# --- Defaults ---

# --- pymarl options ---
runner: "episode" # Runs 1 env for an episode
mac: "basic_mac" # Basic controller
env: "sc2" # Environment name
env_args: {} # Arguments for the environment
batch_size_run: 1 # Number of environments to run in parallel
test_nepisode: 20 # Number of episodes to test for
test_interval: 2000 # Test after {} timesteps have passed
test_greedy: True # Use greedy evaluation (if False, will set epsilon floor to 0
log_interval: 2000 # Log summary of stats after every {} timesteps
runner_log_interval: 2000 # Log runner stats (not test stats) every {} timesteps
learner_log_interval: 2000 # Log training stats every {} timesteps
t_max: 10000 # Stop running after this many timesteps
use_cuda: True # Use gpu by default unless it isn't available
buffer_cpu_only: True # If true we won't keep all of the replay buffer in vram

# --- Logging options ---
use_tensorboard: True # Log results to tensorboard
save_model: True # Save the models to disk
save_model_interval: 2000000 # Save models after this many timesteps
checkpoint_path: "" # Load a checkpoint from this path
evaluate: False # Evaluate model for test_nepisode episodes and quit (no training)
load_step: 0 # Load model trained on this many timesteps (0 if choose max possible)
save_replay: False # Saving the replay of the model loaded from checkpoint_path
local_results_path: "results" # Path for local results

# --- RL hyperparameters ---
gamma: 0.99
batch_size: 32 # Number of episodes to train on
buffer_size: 32 # Size of the replay buffer
lr: 0.0005 # Learning rate for agents
critic_lr: 0.0005 # Learning rate for critics
optim_alpha: 0.99 # RMSProp alpha
optim_eps: 0.00001 # RMSProp epsilon
grad_norm_clip: 10 # Reduce magnitude of gradients above this L2 norm

# --- Agent parameters. Should be set manually. ---
agent: "updet" # Options [updet, transformer_aggregation, rnn]
rnn_hidden_dim: 64 # Size of hidden state for default rnn agent
obs_agent_id: False # Include the agent's one_hot id in the observation
obs_last_action: False # Include the agent's last action (one_hot) in the observation

# --- Transformer parameters. Should be set manually. ---
token_dim: 5 # Marines. For other unit type (e.g. Zeolot) this number can be different (6).
emb: 32 # embedding dimension of transformer
heads: 3 # head number of transformer
depth: 2 # block number of transformer
ally_num: 8 # number of ally (5m_vs_6m)
enemy_num: 8 # number of enemy (5m_vs_6m)

# --- Experiment running params ---
repeat_id: 1
label: "default_label"

This is the config I used to train 8m and I change ally_num and enemy_num to 5. Should I change checkpoint_path? Is the figure you given showing the win rate in training process? How can I get the same one?

Theohhhu commented 2 years ago

Yes. Same checkpoint_path as the model of 8m. There is another similar issue you can refer: #4

Any further concern is welcome.

hellofinch commented 2 years ago

image I get results like this. Is there some thing wrong? There is my default.yaml when yraning 5m.

# --- Defaults ---

# --- pymarl options ---
runner: "episode" # Runs 1 env for an episode
mac: "basic_mac" # Basic controller
env: "sc2" # Environment name
env_args: {} # Arguments for the environment
batch_size_run: 1 # Number of environments to run in parallel
test_nepisode: 20 # Number of episodes to test for
test_interval: 2000 # Test after {} timesteps have passed
test_greedy: True # Use greedy evaluation (if False, will set epsilon floor to 0
log_interval: 2000 # Log summary of stats after every {} timesteps
runner_log_interval: 2000 # Log runner stats (not test stats) every {} timesteps
learner_log_interval: 2000 # Log training stats every {} timesteps
t_max: 10000 # Stop running after this many timesteps
use_cuda: True # Use gpu by default unless it isn't available
buffer_cpu_only: True # If true we won't keep all of the replay buffer in vram

# --- Logging options ---
use_tensorboard: True # Log results to tensorboard
save_model: True # Save the models to disk
save_model_interval: 2000000 # Save models after this many timesteps
checkpoint_path: "results/models/vdn-updet-8m-32-dim-3-heads-2-depth-seed-580488741" # Load a checkpoint from this path
evaluate: False # Evaluate model for test_nepisode episodes and quit (no training)
load_step: 1 # Load model trained on this many timesteps (0 if choose max possible)
save_replay: False # Saving the replay of the model loaded from checkpoint_path
local_results_path: "results" # Path for local results

# --- RL hyperparameters ---
gamma: 0.99
batch_size: 32 # Number of episodes to train on
buffer_size: 32 # Size of the replay buffer
lr: 0.0005 # Learning rate for agents
critic_lr: 0.0005 # Learning rate for critics
optim_alpha: 0.99 # RMSProp alpha
optim_eps: 0.00001 # RMSProp epsilon
grad_norm_clip: 10 # Reduce magnitude of gradients above this L2 norm

# --- Agent parameters. Should be set manually. ---
agent: "updet" # Options [updet, transformer_aggregation, rnn]
rnn_hidden_dim: 64 # Size of hidden state for default rnn agent
obs_agent_id: False # Include the agent's one_hot id in the observation
obs_last_action: False # Include the agent's last action (one_hot) in the observation

# --- Transformer parameters. Should be set manually. ---
token_dim: 5 # Marines. For other unit type (e.g. Zeolot) this number can be different (6).
emb: 32 # embedding dimension of transformer
heads: 3 # head number of transformer
depth: 2 # block number of transformer
ally_num: 5 # number of ally (5m_vs_6m)
enemy_num: 5 # number of enemy (5m_vs_6m)

# --- Experiment running params ---
repeat_id: 1
label: "default_label"

Here is the files in that path. image

Theohhhu commented 1 year ago

Please make sure you use the original SMAC source package without any modification and strictly follow the guide in PyMARL for the installation. Also, the performance metric should be test_battle_win_rate not battle_win_rate. The latter one can be lower as 10 percent random action is executed in the rollout.

hellofinch commented 1 year ago

Thank!