Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model By: DeepMind

Link: SemanticScholar

Comment: This is a DeepMind paper published in 2019. It's a write-up and next step of AlphaGo and AlphaZero, which only targets Go while MuZero targets on wider range of games "without knowing underlying dynamics". It achieves the STOA results.

Problem/Prior Work: ... Once a model has been constructed, it is straightforward to apply MDP planning algorithms, such as value iteration [31] or Monte-Carlo tree search (MCTS) [7], to compute the optimal value or optimal policy for the MDP.... A quite different approach to model-based RL has recently been developed, focused end-to-end on predicting the value function [41]. The main idea of these methods is to construct an abstract MDP model such that planning in the abstract MDP is equivalent to planning in the real environment. This equivalence is achieved by ensuring value equivalence, i.e. that, starting from the same real state, the cumulative reward of a trajectory through the abstract MDP matches the cumulative reward of a trajectory in the real environment.

Innovation:

QiXuanWang / LearningFromTheBest

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model By: DeepMind #38