Replace single-path searches with multi-path rollouts

ariasanovsky commented 11 months ago

This addresses several problems:

as mentioned in #60, models are converging to trivial models, or diverging
as a result, searches are shallow and estimates do not correspond to the costs at terminal nodes
also, model predictions are dominating observation data instead of using observation data from paths that reach terminal nodes
making mathematically rigorous experimental changes to algorithms which require lots of lifetime management
refactor the algorithm so that weights are not required to proportionally define loss over the observation set
handle search node exhaustion better: there are other ways to reduce the memory footprint of ActionData which doesn't sacrifice the accuracy earned by exhausting nodes in the search tree

Objectives:

refactor SearchTree<P> with a layer or ergonomic indirection instead of lifetime management
- [x] redefine SearchTree
- old: StateNode and BTreeMap<P, StateNode>
- new: BTreeMap<P, usize> and a Vec<StateNode> instead of
  - ~~[ ] later issue: arbitrary index maps~~
  - not required yet
- [x] redefine StateNode with:
- c: f32 $= c(s)$
- c_star: f32 $=c_T^\ast(s)$
- in_neighborhood: Vec<usize> $= N_T^{-}(s)$
- active_actions/exhausted_actions $= \mathcal{A}(s)$
- [x] redefine ActionData with:
- a: usize $=a$
- s_prime: Option<NonZeroUsize> $=a\cdot s$
- g_sa: f32 $=g_T(s, a)$
- [x] redefine Transition with explicit positions
- [x] replace the use of upper estimate functions with decay terms on arcs in the search tree
- [x] refactor rollouts so that they act on the search tree with multiple paths instead of single paths
- more specifically, SearchTree::roll_out should act on a SearchPath which holds/borrows:
  1. a mutated clone of the root node
  2. a corresponding path
  3. a Vec of transition data
[x] do not use any model predictions when writing the observation vector
- instead, use model predictions to bias the search
- we want the observation data corresponding to the transition $(s_0, a)$ to use $c_T^\ast(a\cdot s_0)$
[x] refactor out the weights and instead sum loss over all transitions of the form $(s_0, a)$ where the value $c_T^\ast(a\cdot s_0)$ is defined
cascade exhaustion down
- ~~[ ] when a node at path $p$ is exhausted, consider all solutions to $a\cdot q = p$ and exhaust the action corresponding to $(q, a)$~~
- overkill
- ~~[ ] ?en lieu of, or addition to, in_neighborhoods~~
- moved to #66
- ~~[ ] to accommodate this, we could tag ActionData with an enum marking partial initialization~~
- ~~i.e., g_sa could be in one of 3 states: uninit, active, or exhausted~~
- ~~alternatively, this could be eliminated with a SoA refactor~~
  - ~~actions: Vec<ActionData> could instead be indices used to slice into shared Vec<_>s~~
  - overkill
[x] delete old StateNode, SearchTree, ActionData, Transition, TransitionMetadata, etc
~~[ ] ?SoA refactor~~

Learning Loop

In each epoch, we have a BATCH-sized set of roots and equally many search trees.
A rollout consistents of a search through the tree, ending with/acting on a search path.
With BATCH different search paths, we write a (BATCH $\times$ STATE)-dimensional tensor to evaluate, corresponding to states $s_i$ where no node was found in the corresponding search tree.
The model writes a prediction used to populate the corresponding search tree.

Rollout

A search path is initialized with:
- a clone $s_i$ of the root $s_0$,
- an empty path $p_0 = \emptyset$, and
- an empty sequence $\tau = ()$ of transition data/indices.
A rollout mutates the search path across multiple episodes, provided the root node for $s_0$ is not exhausted.
In each episode, we select an action $ai\in\mathcal{A}(s{i-1})$ which:
- mutates $s_{i-1}\to s_i = ai\cdot s{i-1}$,
- extends $p_{i-1}\to p_i = ai\cdot p{i-1}$, and
- appends transition $\tau_i$ to $\tau$.
If $s_i$ is exhausted:
- The search path terminates.
- Data is updated inside the tree at $s_0,\dots,s_i$ and at $a_0,\dots,a_i$.
- $s_i$ is replaced with a clone of $s_0$
- $p_i$ is cleared to an empty path
- $\tau$ is emptied
Else, if $s_i$ does not have a node:
- The rollout ends and $s_0$ is vectorized into the buffer, awaiting prediction.
Else:
- we increment $i\to i+1$ and continue the loop

Node Initialization

For each $s\in\mathcal{S}$, we combine:
- $c(s)\in\mathcal{C}$,
- a (possibly trivial) reward function $r(s, \cdot): \mathcal{A}(s)\to\mathcal{R}$, and
- a model prediction $h_\theta(s, \cdot): \mathcal{A}(s)\to\mathcal{H}$,
to initiialize an action-value function $g_T(s, \cdot): \mathcal{A}(s)\to\mathbb{R}$ on newly inserted tree nodes.

Search Tree

Since we are moving to a Vec<StateNode>, we will have an easier time keeping track of $\text{argmin}$ data with indices. For each node $n$ corresponding to some state $s$, we log in StateNode

the value $c(s)$, and
$c_T^\ast(s)$ as either the index or the index and value, where

$$c_T^\ast(s) = (\text{arg})\text{min}\left(c(s'): s'\in V(T[s..])\right)$$

data about actions, e.g., Vec<ActionData>, where ActionData may contain:
- the index $0\leq i_a < \left|\mathcal{A}\right|$ corresponding to $a\in\mathcal{A}(s)$
- the position Option<NonZeroUsize> of $a\cdot s$ in the Vec<StateNode> if it is known
- a quantity $g_T(s, a)$ which updates as follows:
- suppose $(s, a)$ is in a search path which terminates (at an exhausted node)
- if the path produces an improvement in $c_T(s)$, then we max-update $g_T(s, a)$ with $c(s) - c_T^\ast(s)$
- else, we decay $g_T(s, a)$ by a factor of $\gamma\in(0, 1]$, e.g., $\gamma = 0.95$

Dominating Sets

We have used the notation $T[s..]$ to indicate the branch of $T$ starting from $s$. We may extend $T$ to its transitively closed reachability digraph $D = \text{Reach}(T)$. We can say that $s'$ dominates $s$ if $(s, s')\in E(D)$ and $c(s')\leq c(s)$, and use the remaining arcs to define the domination sub-digraph $\text{Dom} \subseteq D$. By tracking $c_T(s)$, we dynamically retain a minimal cover of $\text{Dom}$ with minimal overhead.

Interpreting and updating $h_\theta$

Goal: use $h_\theta$ to initialize $g_T(s,\cdot)$ values, but only use $cT^\ast(s)$ when providing updates to $h\theta$

For example, we may let $h_\theta(s, a)$ be:

predicted improvement in $c$ from $s$: calculated as $h_T(s, a) = c(s) - c_T^\ast(a\cdot s)$
or from $a\cdot s$: calculated as $h_T(s, a) = c(a\cdot s) - c_T^\ast(a\cdot s)$

ariasanovsky commented 11 months ago

We'll start with

First draft

struct SearchTree<P> {
    positions: BTreeMap<P, NonZeroUsize>,
    nodes: Vec<StateNode>,
}

struct StateNode {
    c: f32,
    c_star: f32,
    in_neighborhood: Vec<usize>,
    active_actions: VecDeque<ActionData>,
    exhausted_actions: Vec<ActionData>,
}

struct ActionData {
    a: usize,
    s_prime: Option<NonZeroUsize>,
    g_sa: f32,
}

and then probably migrate to an SoA.

ariasanovsky commented 11 months ago

Multi-path rollouts