datamllab / rlcard

Reinforcement Learning / AI Bots in Card (Poker) Games - Blackjack, Leduc, Texas, DouDizhu, Mahjong, UNO.
http://www.rlcard.org
MIT License
2.86k stars 618 forks source link

Doudizhu performance worse than Rule based? #160

Closed ammaddd closed 3 years ago

ammaddd commented 4 years ago

I trained doudizhu dqn for 100,000 episodes but the performance is worse than rule based agent?

daochenzha commented 4 years ago

@ammaddd It is normal because the current DouDizhu has sparse rewards so that the agent may get stuck. A possible way to improve the performance is to use the rule to generate training data and train the model with supervised learning. Then load the weights of supervised learning and keep training the Agent with RL.

ammaddd commented 4 years ago

@ammaddd It is normal because the current DouDizhu has sparse rewards so that the agent may get stuck. A possible way to improve the performance is to use the rule to generate training data and train the model with supervised learning. Then load the weights of supervised learning and keep training the Agent with RL.

What do you mean by "generating training data"? Because the data that is generated is all the possible moves at that instant? Do you mean we should train DQN with Rule-based agent? I mean we can train like this Player1: DQN Player2: Rule-Based Player3: Rule-Based. Will this improve performance?

daochenzha commented 4 years ago

@ammaddd Hi, sorry I just saw your reply. "generating training data" means that we use Player1: rule agent, Player2: rule agent, Player3: rule agent to play the game and generate state-action pairs and the rewards. Then we use these state-action pairs as training data to train the DQN network with supervised learning. This can "initialize" the DQN so that it can play like the rule agent. Then we keep training the DQN with reinforcement learning.

The idea is that training from scratch using RL is hard on sparse environments like Dou Dizhu. "Initializing" it with the weights from supervised learning may accelerate the learning progress and achieve better performance. This could be a way to train an agent to beat the rule. However, some engineering efforts are required for this idea.

billh0420 commented 4 years ago

You could also modify the DQN code to retrieve the move that the rule agent would play in a given position and use that move as the move that the DQN agent wants to do. I think this would only require less than 10 lines of code change.

ammaddd commented 4 years ago

But from the result graph, it is converging well? If it was stuck at some local then the convergence wouldn't be so smooth? Does it mean that high reward does not always mean a good model? At 0.72 reward model is worst than the rule-based image

This is a fight between DQN vs RuleBased vs RuleBased. Only 6% games won by DQN image

daochenzha commented 4 years ago

@ammaddd I guess the top figure is the rewards against the random agent. However, random agent is weak. Even if it can beat random agents, it may be still far behind rule agent. A better way is to train DQN against rule-agent, or "initializing" the DQN with rules and keep training.

daochenzha commented 3 years ago

We now have a strong DouDizhu agent at https://github.com/datamllab/rlcard/tree/master/rlcard/agents/dmc_agent

also at https://github.com/kwai/DouZero