MDP Agents on Bandit Tasks

abrahamnunes commented 6 years ago

It is difficult to use an MDP agent on a Bandit task, mainly because of the eligibility trace update.

On a contextual 2 armed bandit task, the final action is $\mathbf u' = (0.5, 0.5)^\top$. The 0.5's are necessary in order to facilitate computation of the target $y_t = r_t - \mathbf u'^\top \mathbf Q \mathbf x'$ such that

$\mathcal V(\mathbf x') = \frac{1}{|\mathcal U|} \sum_i \mathcal Q(\mathbf x', u_i).$

However, the eligibility trace is updated as

$\mathbf z \gets \mathbf u' \mathbf x^\top + \gamma \lambda \mathbf z,$

which in a 4 state (2 context, 2 outcome) task with $\lambda = \gamma = 1$, and where $\mathbf x = (1, 0, 0, 0)^\top$, $\mathbf u = (1, 0)^\top$ and $\mathbf x' = (0, 0, 1, 0)^\top$, should result in a trace that looks like

$\mathbf z = \left[ \begin{array}{cccc} 1 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 \\ \end{array} \right ]$

The current setup will allow either the correct trace or the correct target calculation.

I think the solution may be to separate the trace updating function from the value function updating.

ARudiuk commented 6 years ago

Some of the math seems to not be rendering @abrahamnunes

hardik44fg commented 3 years ago

@abrahamnunes Try to highlight the important words so it will help someone to easily understand

abrahamnunes / fitr

MDP Agents on Bandit Tasks #101