Reinforcement Learning Bots

Introduction

Here, I introduce the Emulation Equation, a game-theoretic equation for emulation.

The equation represents all emulation, allows for powerful features, and all gameplay (human or otherwise) becomes data for training an artificially-intelligent game-playing agent.

For demonstration purposes, I've split the explanation into two theories:

The Special Theory of Emulation explains the fundamentals, and presents a simplified emulation equation. Using this, all emulation and many powerful features are possible.
The General Theory of Emulation uses the same fundamentals, and presents an extended emulation equation. Using this, all gameplay (human or otherwise) becomes data for training a reinforcement learner.

For the physics nerds, this is analogous to how Einstein introduced relativity:

In 1905, the Special Theory of Relativity explained moving bodies without gravity
In 1915, the General Theory of Relativity explained moving bodies in the presence of gravity

Background

Reinforcement learning is popular in game-playing AI because the reward signal is often sparse (not evident every frame) and depends on actions taken much earlier in the game.

The Q-learning algorithm is well-suited for teaching reinforcement learners how to play a video game because it does not require a model of the environment, which would be difficult to create for even the most basic computing machine.

Recall the algorithm from Q-learning:

Q-learning

Understanding that is out of the scope of this documentation. Just know that it learns over "emulation equations", which describe a series of frames in the discrete time domain.

Here I present two theories of emulation. The Special Theory of Emulation is the smallest equation needed for emulation. The General Theory of Emulation expands on this equation, allowing it to be used for Q-learning.

Special Theory of Emulation

The Special Theory of Emulation presents the smallest equation (the emulation equation) needed to represent all emulation.

Emulation variables

Game-theoretic emulation uses two variables:

$State: S$

State consists of video, audio, and memory regions (RAM, SRAM, Real-time clock).

$Action: A$

Action is the combined state of all input devices.

Time series

Emulation occurs at discrete time steps, so every time step has its own instance of these variables:

$S == S_t$ $A == A_t$

The emulation history is therefore a time series of tuples containing these two variables:

(S₀, A₀, S₁, A₁, ...)

Emulation model

Time steps occur by applying a set of functions to the emulation variables:

$PlayFrame()$

The PlayFrame() function takes the previous state, along with the most recent action, and produces a new state

$GetInput()$

The GetInput() function takes the previous action, along with the most recent state, and produces a new action

Emulation equation

The emulation equation is a time series model consisting of the initial conditions, as well as the model used for each time step.

The initial condition of all variables is the empty set:

$S_0 = 0$ $A_0 = 0$

The variables then evolve by applying the functions in sequence:

$S_t+1 = PlayFrame(S_t, A_t)$ $A_t+1 = GetInput(S_t+1, A_t)$

Summary

The emulation equation describes something fundamental in every emulator; play a frame, get input, repeat.

Interestingly, this fundamental concept was not a priori. It emerged as a model from solving an algorithm.

Surprising facts also appear. A₀ is empty; the first frame is played with all buttons unpressed. Upon deeper inspection, this results because Q-learners get no value from an Action without a prior State observation.

Next, we present the General Theory of Emulation, which expands on these fundamentals to include two new concepts needed in the Q-learning algorithm.

The General Theory of Emulation

The general theory of emulation extends the emulation equation so that it can be used for Q-learning.

Note: I also wanted to choose strategies for my Q-learners, such as "walk up" or "reach level 2". I extended Q-learning to depend on a Policy variable in the time series. When the strategy is the identity function (no strategy), this extended learning algorithm decays to Q-learning.

Emulation variables

Reinforcement learning adds two variables:

$Reward: R$

Reward is used to train the function approximator that infers an action from the observed state. The reward can come from sniffing RAM, such as the achievements at http://retroachievements.org, or reading a value from video memory using OCR.

$Policy: pi$

Policy is used by the agent to choose the next move. The goal of reinforcement learning is to choose the policy that maximizes the reward.

Time series

Emulation occurs at discrete time steps, so every time step has its own instance of these variables:

$S == S_t$ $R == R_t$ $pi == pi_t$ $A == A_t$

The emulation history is therefore a time series of tuples containing these four variables:

(S₀, R₀, π₀, A₀, S₁, R₁, π₁, A₁, ...)

Emulation model

Reinforcement learning also needs two more functions:

$GetReward()$

The GetReward() function takes the previous reward, along with the most recent values of the other variables, and produces a new reward

$Strategize()$

The Strategize() function takes the previous policy, along with the most recent values of the other variables, and produces a new policy

Emulation equation

The emulation equation is a time series model consisting of the initial conditions, as well as the model used for each time step.

The initial condition of all variables is the empty set:

$S_0 = 0$ $R_0 = 0$ $pi_0 = 0$ $A_0 = 0$

The variables then evolve by applying the functions in sequence:

$S_t+1 = PlayFrame(S_t, R_t, pi_t, A_t)$ $R_t+1 = GetReward(S_t+1, R_t, pi_t, A_t)$ $pi_t+1 = Strategize(S_t+1, R_t+1, pi_t, A_t)$ $A_t+1 = GetInput(S_t+1, R_t+1, pi_t+1, A_t)$