garbear / xbmc

XBMC Main Repository
http://xbmc.org
Other
134 stars 53 forks source link

Reinforcement Learning Bots #89

Open garbear opened 6 years ago

garbear commented 6 years ago

Introduction

Here, I introduce the Emulation Equation, a game-theoretic equation for emulation.

The equation represents all emulation, allows for powerful features, and all gameplay (human or otherwise) becomes data for training an artificially-intelligent game-playing agent.

For demonstration purposes, I've split the explanation into two theories:

For the physics nerds, this is analogous to how Einstein introduced relativity:

Background

Reinforcement learning is popular in game-playing AI because the reward signal is often sparse (not evident every frame) and depends on actions taken much earlier in the game.

The Q-learning algorithm is well-suited for teaching reinforcement learners how to play a video game because it does not require a model of the environment, which would be difficult to create for even the most basic computing machine.

Recall the algorithm from Q-learning:

Q-learning

Understanding that is out of the scope of this documentation. Just know that it learns over "emulation equations", which describe a series of frames in the discrete time domain.

Here I present two theories of emulation. The Special Theory of Emulation is the smallest equation needed for emulation. The General Theory of Emulation expands on this equation, allowing it to be used for Q-learning.

Special Theory of Emulation

The Special Theory of Emulation presents the smallest equation (the emulation equation) needed to represent all emulation.

Emulation variables

Game-theoretic emulation uses two variables:

State: S

Action: A

Time series

Emulation occurs at discrete time steps, so every time step has its own instance of these variables:

S == S_t A == A_t

The emulation history is therefore a time series of tuples containing these two variables:

(S0, A0, S1, A1, ...)

Emulation model

Time steps occur by applying a set of functions to the emulation variables:

PlayFrame()

GetInput()

Emulation equation

The emulation equation is a time series model consisting of the initial conditions, as well as the model used for each time step.

The initial condition of all variables is the empty set:

S_0 = 0 A_0 = 0

The variables then evolve by applying the functions in sequence:

S_t+1 = PlayFrame(S_t, A_t) A_t+1 = GetInput(S_t+1, A_t)

Summary

The emulation equation describes something fundamental in every emulator; play a frame, get input, repeat.

Interestingly, this fundamental concept was not a priori. It emerged as a model from solving an algorithm.

Surprising facts also appear. A0 is empty; the first frame is played with all buttons unpressed. Upon deeper inspection, this results because Q-learners get no value from an Action without a prior State observation.

Next, we present the General Theory of Emulation, which expands on these fundamentals to include two new concepts needed in the Q-learning algorithm.

The General Theory of Emulation

The general theory of emulation extends the emulation equation so that it can be used for Q-learning.

Note: I also wanted to choose strategies for my Q-learners, such as "walk up" or "reach level 2". I extended Q-learning to depend on a Policy variable in the time series. When the strategy is the identity function (no strategy), this extended learning algorithm decays to Q-learning.

Emulation variables

Reinforcement learning adds two variables:

Reward: R

Policy: pi

Time series

Emulation occurs at discrete time steps, so every time step has its own instance of these variables:

S == S_t R == R_t pi == pi_t A == A_t

The emulation history is therefore a time series of tuples containing these four variables:

(S0, R0, π0, A0, S1, R1, π1, A1, ...)

Emulation model

Reinforcement learning also needs two more functions:

GetReward()

Strategize()

Emulation equation

The emulation equation is a time series model consisting of the initial conditions, as well as the model used for each time step.

The initial condition of all variables is the empty set:

S_0 = 0 R_0 = 0 pi_0 = 0 A_0 = 0

The variables then evolve by applying the functions in sequence:

S_t+1 = PlayFrame(S_t, R_t, pi_t, A_t) R_t+1 = GetReward(S_t+1, R_t, pi_t, A_t) pi_t+1 = Strategize(S_t+1, R_t+1, pi_t, A_t) A_t+1 = GetInput(S_t+1, R_t+1, pi_t+1, A_t)

LipkeGu commented 6 years ago

As far i understand, you mean to develop a "virtual" player which is available in each game / rom?

garbear commented 6 years ago

right. so far the math in this issue just describes the data we need to gather to make this happen. then it can be uploaded to the cloud for training, and depending on the state of embedded tensorflow, inference can be run locally or in the cloud if we get netplay support.