facebookresearch / hanabi_SAD

Simplified Action Decoder for Deep Multi-Agent Reinforcement Learning
Other
96 stars 35 forks source link

Which level OBL was uploaded in the March 2021 push? #28

Open hoseasiu opened 3 years ago

hoseasiu commented 3 years ago

Hi Hengyuan,

We've been trying out the OBL model that you had uploaded, and it's a very good agent - certainly the most human and performant of the learning-based agents I've played with. Two questions came up when we tried it that we were hoping you could clarify.

1) The paper refers to multiple levels of OBL bots, but only one was uploaded, and it wasn't clear which one this was from the readme or the bot name. Which level was it? In our (human) interactions with it, it occasionally played cards without full information, especially when given a hint on a newly-drawn card, which seems to indicate deviation from optimal grounded policy and make it a higher-order OBL behavior to me?

2) We also noticed that the bot sometimes makes incorrect play attempts on cards with full information, again typically when the cards are newly drawn and hinted towards. This seems to be a case where learned convention at higher levels is overriding optimal grounded policy? Is that consistent with your experience?

Thanks! Hosea

hengyuan-hu commented 3 years ago

The OBL agent available here is a level 4 agent, so it is not the grounded policy.

That is very abnormal. How do you play with the agent? If you are using the UI in the SPARTA repo and the convert_model.py script here, then the converted model is wrong. The OBL agents use a public-private network, which is different from the model in the convert_model.py

0xJchen commented 3 years ago

Hi, hengyuan. Does it make sense to interpret the public-private network as a way for accelerating training & inference? In the LBS paper, the model is described as follows:

To avoid having to re-unroll the policies for the other agents from the beginning of the game for each of the sampled τ , LBS uses a specific RNN architecture.

My understanding is that—— instead of each agent individually unfolding their priv_s through their own lstm(like in SAD ), now agents can share the encoding of the public observation, and their private observations are just simply forwarded through an MLP. I wonder if my understanding is correct.

hoseasiu commented 3 years ago

@hengyuan-hu Thanks for the response on the OBL level, that makes sense to me.

@keenlooks did most of the work to make the OBL model work with a slightly modified version of the webapp from the SPARTA repo, but from what I gather, he didn't use the convert_model.py script. It was based on the code from https://github.com/facebookresearch/hanabi_SAD/blob/master/pyhanabi/tools/obl_model.py @keenlooks - any other details there?

keenlooks commented 3 years ago

@hoseasiu that's correct. I created a forward method in the class in obl_model.py with the inputs/outputs SPARTA expected, loaded the weight values from obl.pthw, then exported the class via torch.jit.save.

hengyuan-hu commented 3 years ago

@keenlooks That sounds right. Have you checked the selfplay score of the converted jit model? @hoseasiu Can you specify a bit more on "makes incorrect play attempts on cards with full information, again typically when the cards are newly drawn and hinted towards". If the card is newly drawn, how does the bot know the full info? Have you hinted both color and rank? For this bot I think if you hint the color of the newly drawn card it will have a very high tendency to play it, a conventions learned from previous level OBL belief.

hengyuan-hu commented 3 years ago

@PeppaCat If we use a private lstm, then we not only need to sample a hand (o_private), but also the entire history of my hand (tau_private). Therefore we have to use a network structure where the recurrent part does not depend on tau_private, both feed-forward & public-private network satisfy this requirement.

hoseasiu commented 3 years ago

By "newly drawn," I just mean that it's the newest card in the bot's hand. It will have been there for at least long enough for it to receive two hints that give it full information on the card, but in the interim, the bot didn't draw anything new, so it's still the newest card. In the cases we saw, the bot would play that newest card after the second applicable hint, even though when taken together, the revealed information on that card gave it perfect information that the card was in fact not playable. We can post some examples the next time we test.

keenlooks commented 3 years ago

@keenlooks I have not checked the self-play score of the converted JIT model. @hoseasiu do you now if you all have been able to check that yet?