Closed ariasanovsky closed 1 year ago
Creating this issue resolves #29
After looking at search data, I realized in some spaces that the the agent is incentivized too strongly towards exploration and rarely reaches terminal nodes. The formula $$u(s, a) = \overline{g}^\ast(s, a) + c\cdot p(s, a)\cdot\dfrac{\sqrt{n(s)}}{1+n(s, a)}$$ places incentives on:
but doesn't factor in the depth of the node corresponding to $s$ or whether the paths visiting $(s, a)$ reached terminal nodes. So I am testing different upper estimate functions and adding $u(s, a)$ to be user-specified.
Moving
to #14
This is an elaboration on a task mentioned in #14
To discourage the tree from revisiting previously visited terminal nodes:
Exhaustion
ActionData
of the penultimate node corresponding to the action leading to the terminal node asExhausted
enum
withActive/Exhausted
variants to conform toStateData
Vec
ofActionData
0
actions which areActive
, switch the node fromActive
toExhausted
StateData
CostLog
or a separate helper structBetter upper estimate