It is difficult to use an MDP agent on a Bandit task, mainly because of the eligibility trace update.
On a contextual 2 armed bandit task, the final action is $\mathbf u' = (0.5, 0.5)^\top$. The 0.5's are necessary in order to facilitate computation of the target $y_t = r_t - \mathbf u'^\top \mathbf Q \mathbf x'$ such that
However, the eligibility trace is updated as
which in a 4 state (2 context, 2 outcome) task with $\lambda = \gamma = 1$, and where $\mathbf x = (1, 0, 0, 0)^\top$, $\mathbf u = (1, 0)^\top$ and $\mathbf x' = (0, 0, 1, 0)^\top$, should result in a trace that looks like
The current setup will allow either the correct trace or the correct target calculation.
I think the solution may be to separate the trace updating function from the value function updating.
It is difficult to use an MDP agent on a Bandit task, mainly because of the eligibility trace update.
On a contextual 2 armed bandit task, the final action is $\mathbf u' = (0.5, 0.5)^\top$. The 0.5's are necessary in order to facilitate computation of the target $y_t = r_t - \mathbf u'^\top \mathbf Q \mathbf x'$ such that
However, the eligibility trace is updated as
which in a 4 state (2 context, 2 outcome) task with $\lambda = \gamma = 1$, and where $\mathbf x = (1, 0, 0, 0)^\top$, $\mathbf u = (1, 0)^\top$ and $\mathbf x' = (0, 0, 1, 0)^\top$, should result in a trace that looks like
The current setup will allow either the correct trace or the correct target calculation.
I think the solution may be to separate the trace updating function from the value function updating.