facebookresearch / hanabi_SAD

Simplified Action Decoder for Deep Multi-Agent Reinforcement Learning
Other
97 stars 35 forks source link

Could you clarify on the input and output dimension of the architectures? #11

Closed hnekoeiq closed 4 years ago

hnekoeiq commented 4 years ago

Hi, I am trying to figure out what is the observation space and action space for IQL and SAD agents. From the original Hanabi environment for the case of 2 players, it can be seen that vectorized observation space has a dimension of 658, and action space has 20 valid actions in total. I have some questions which I will be grateful if you could make some clarifications. 1- I noticed that IQL's observation space has 783. It makes sense if you have added your own hand observation which is 125 bits even though I do not understand why we should do this when we do not have the auxiliary task for IQL agents. However, I am also not sure how did you have added this to the original vector (before the opponent's observation or after it) 2- The observation space has 838 bits for the SAD agent, which again makes sense if you have added the extra greedy action (55 bits). But, have you inserted it after exploratory action or concatenated to the end of the vector? 3- For both of the above cases, action space has a dimension of 21 compared to the original action space with 20 elements. Could you clarify what is the extra element and how did you add it?

Thanks in advance for your response.

hengyuan-hu commented 4 years ago

1) yes we reserve the first 125 bits for encoding own hand. But this is disabled now, producing all 0 for the first 125 bits. This was for a legacy experiment where we used centralized value function.

2) append to the end of the vector

3) there is 1 extra action for "pass" when it is not the bot's turn to act.

You can refer to cpp/hanabi_env.h for the details.

hnekoeiq commented 4 years ago

Thank you!

hnekoeiq commented 3 years ago

Hi @hengyuan-hu, I had a related question regarding evaluating SAD agents. Based on the SAD paper, the agents don't use extra greedy action during execution. Could you help me find out how do you disable it during the evaluation?

Ideally, I want to be able to evaluate one SAD agent against a non-SAD agent. Is there a simple way to reach this goal without changing the observation vector of the non-SAD agent?

hengyuan-hu commented 3 years ago

As mentioned in the SAD paper,

Since we set ε to 0 at test time we can simply use the, now greedy, environment action obtained from the observation function as our greedy-action input.

At test time the SAD agent uses the same input size and the extra greedy action is just a duplicate of the action.

As mentioned in our previously discussion, the extra action section is appended at the end of the input, so the easiest way to evaluate a SAD agent vs non-SAD agent in the current version of the code is to use SAD featurization and use the sliced non-SAD version for the non-SAD bot. You can also modify the code to produce both SAD and non-SAD version of the input.

i.e. add another feature here: https://github.com/facebookresearch/hanabi_SAD/blob/54a8d34f6ab192898121f8d3935339e63f1f4b35/cpp/hanabi_env.cc#L198