YuriCat / MuZeroJupyterExample

63 stars 15 forks source link

Help please #2

Open Zeta36 opened 4 years ago

Zeta36 commented 4 years ago

I'm trying to fill the pseudocode.py DeepMind attached to its paper about MuZero. for the network part I used a very similar structure to yours in this repository but I'm not able to make the training process to converge.

By the way, I tried with Connect4 game instead of tic-tac-toe.

Could you please @YuriCat take a look just in case I made some mistake? https://github.com/Zeta36/muzero

Thank you!!

YuriCat commented 4 years ago

Actually, my sample is also failing to achieve optimal policy. I think MuZero might need many self-play games in the early stage to obtain good abstract transition model.

I saw part of your code and I found one different point. In board games, the training target after terminal state is random actions for policy, and alternating values +1/-1 for value prediction if one of the players has won.

Since this point was not clearly written in the original paper, I asked a question to the author of it. He taught me that technique.

I hope this will help you!

Zeta36 commented 4 years ago

Thank you very much for your reply @YuriCat. I also think the major problem is the hardware issue. Let's wait for getting some hundreds TPU cards at home :P.

Please, any new step forward you get, push it in your repo.

johan-gras commented 4 years ago

If somehow that could help, I have a working implementation for the gym Cartpole environment (1 player). The optimal policy is reached relatively quickly (500 episodes or so). Nonetheless, Cartpole is a lot simpler to solve than a board game :)

https://github.com/johan-gras/MuZero

ZHANGRUI666 commented 4 years ago

Thank you very much for your reply @YuriCat. I also think the major problem is the hardware issue. Let's wait for getting some hundreds TPU cards at home :P.

Please, any new step forward you get, push it in your repo.

Hello,Zeta! how is everything going? has your Muzero converged? I build one too, it is also fail to reach optimal policy

ipsec commented 3 years ago

My code is not converging too. Can someone help me identify where I'm wrong?

My repo https://github.com/ipsec/muzero