legal_action filtering in predict() of DQN

zedwei commented 3 years ago

First of all, thanks a lot for open-sourcing this excellent framework. It helped me a lot to learn RL and develop the AI strategy for my own card game.

During debugging, I noticed that the current DQN agent doesn't apply legal_action filtering in its predict() function. It leads to two "potential" problems in my opinion -

When the legal_action is a small subset of the overall action space(it's usually the case for complicated card games like doudizhu or tractor), it's highly possible that argmax here doesn't fall into any of the legal actions. In this case, the sampling distribution in step() function(this line) will be a uniform distribution (random sampling).
The effective epsilon (exploration) factor will be much smaller than the assigned value. E.g. let's assume there are 100 actions overall, and in the current state there are 10 valid actions. The assigned epsilon is 0.1. Right now the code works like below -
- Firstly, the epsilon will be distributed to ALL 100 actions(0.001 each), and the best action will be assigned with 0.901 (0.001 + 0.9).
- Secondly, remove the illegal actions, so end up with 9 actions with 0.001 probability and 1 (best) action with 0.901 probability.
- Lastly, it normalizes the probabilities, which will end up with 9 actions with ~0.001 and 1 (best) action with 0.990 probability. But IMO it should be 9 actions with 0.01 and 1 (best) action with 0.91 instead.

I think this is the question of whether to apply legal_action filtering before(or after) best action selection. I'm not sure if there is any theory to support either way. But in my own card game, applying the filtering BEFORE best action selection works much better and lead to quicker/more meaningful learnings.

daochenzha commented 3 years ago

@zedwei Thanks a lot! it is a very good point that we didn't notice before. I agree with you that it will be better to remove illegal actions before the argmax. We will do an internal comparison between removing illegal ones before or after argmax and let you know.

BTW, you mentioned you observe better performance on your own card game when applying before the best action selection. May I know what game you are working on and how large the action space is? In addition, just want to let you keep in mind that we welcome any contributions of implementations of new games to enrich the toolkit if you are interested :)

zedwei commented 3 years ago

I'm working on the Tractor (拖拉机, 80分, 升级, etc. whatever you call it) card game :). The action space is a bit tricky as in Tractor you can 甩牌 and 贴牌 (sorry I don't know how to call them in English), especially the latter one which can be any combination of cards (and will lead to super large action space). In the current stage, I'm temporarily ignoring 甩牌, and make 贴牌 a single "pass" action following some rules. With this trick, the action space is around 200 at this moment. But the legal_action is usually much smaller than that especially when you're NOT the 1st player in the round (i.e. <20). I guess it's the similar case in Doudizhu?

I initially tried NFSP, and it didn't work well (before fixing the issue I mentioned in this thread). Then I debugged into it, simplified the training by just using DQN, and found this potential issue. After changing the logic, the training became much more efficient (at least it can beat the naive rule-based agent I developed). I guess the agent was simply randomly "exploring" before the fix and seldom took the "best action". Now I'm going back to NFSP, and may consider trying DeepCFR as well.

Yes, I'm definitely happy to contribute once I clean up/refactor my code a bit(if there is no one working on the Tractor game).

daochenzha commented 3 years ago

@zedwei Tractor should be very similar to DouDizhu (large action space, but only small parts it is legal). I think there is no one working on the tractor. Contributions are greatly welcome!

We have been working on DouDizhu for a while and we find that making use of the action features can efficiently handle the large action space. This feature will be supported in our new version. Hopefully, it will help in DouDIzhu as well.

We will make the removing legal action parts within the predict function in the new version. Thanks again for pointing this out!

zedwei commented 3 years ago

This is great to know! Although right now my trained agent can beat rule-based agent, it's still far away from competing with human - e.g. It was able to learn to play "A", pairs, and tractors first. But it couldn't (yet) learn to "吊主牌" to pass the turn to its partner (when itself is out of good moves). It couldn't learn to exhaust one suit first so it can "kill" others' hand when they play this suit, etc. These are kind of "implicit" strategies human players summarized from our experiences, and sometimes requires the partner's collaboration in order to work (team play). I'm not sure if today's RL/DL algos are powerful enough to learn these strategies by self-play. If so, what would be the "trick" to make these happen? Any thought/experience on this would be much appreciated!

daochenzha commented 3 years ago

I think so. GIven enough training, the agent should gradually learn these skills.

zedwei commented 3 years ago

@daochenzha - Similarly, this line in the DQN train() function probably also needs to be re-considered/experimented.
best_actions = np.argmax(q_values_next, axis=1)

Theoretically it makes more sense(at least to me) to apply legal_action filter to the "next best action" as well during training. Right now the code gets the argmax of all actions without legal_action filtering.

Let me know if it makes sense. I'm still running some experiments but initial evidences showed that adding legal_action filtering in training leads to better result in my tractor game.

daochenzha commented 3 years ago

@zedwei Thanks for the feedback! Yes, I agree with you. But I am hesitating what is the most efficient way to implement this. I expect we need to save the legal actions to the buffer as well and sample them from the buffer. In this way, we don't need to obtain the legal actions every time we sample a transition. Then we use the argmax based on the index of the legal actions. Is it how you implement this?

You may take a look at the code in dmc branch, where the legal actions are applied in predict as you suggests https://github.com/datamllab/rlcard/blob/dmc/rlcard/agents/dqn_agent.py#L177-L180

Please let me know if you have any suggestions.

zedwei commented 3 years ago

Yes, I did push the legal_actions of next state into buffer in my own implementation - basically add another element in the Transition unit. This is much faster than re-computing legal actions during training.

daochenzha commented 3 years ago

@zedwei I have implemented this. See https://github.com/datamllab/rlcard/blob/dmc/rlcard/agents/dqn_agent.py#L195-L200

Interestingly, when I run on Dou Dizhu, I didn't observe a clear difference.

zedwei commented 3 years ago

Interesting. I'll give your implementation a try in my game when I have time and compare with my own implementation. By looking at the code they're doing the same thing. Btw, below is my naive implementation without np matrix operations. You may give it a try as well in doudizhu.

        best_actions = []
        for batch in range(self.batch_size):
            next_action = 0
            max_q = None
            for action in legal_actions_batch[batch]:
                if (max_q is None or q_values_next[batch][action] > max_q):
                    next_action = action
                    max_q = q_values_next[batch][action]
            best_actions.append(next_action)

datamllab / rlcard

legal_action filtering in predict() of DQN #215