Open zsunberg opened 1 year ago
missing
and argmax do not play well together...
julia> argmax([1.0, missing, 3.0])
2
julia> argmin([1.0, missing, 3.0])
2
We could use nothing
:
julia> argmax([1.0, nothing, 3.0])
ERROR: MethodError: no method matching isless(::Float64, ::Nothing)
julia> argmax(something.([1.0, nothing, 3.0], -Inf))
3
but returning nothing
from value
does not seem right.
I guess maybe -Inf
is actually the right thing because this policy considers any actions without an alphavector to be infinitely bad?
I think this is problematic even in the case where every action has an alpha vector: value(policy, b, a)
will only be correct for the optimal action, but might be wrong for dominated actions if you are just maximizing over the alpha vectors corresponding to a
.
For example in TigerPOMDP, consider a belief b
for which it is optimal to open a door. If you want to compute value(policy, b, :listen)
the alpha vectors corresponding to :listen
give a suboptimal action value estimate for :listen
in this belief b
. The alpha vector for listening in b
is plotted in red below and is globally dominated (on reachable beliefs) by other alpha vectors, but not dominated if you only consider :listen
. So value(policy, b, :listen)
should be approx 25.4 and not 24.7.
I think to implement this correctly a 1-step lookahead for a
is necessary, which would also solve the issue of actions that don't have alpha vectors.
value
is not required to return the true value of the policy, it is just what the policy thinks the value is. For example, QMDP returns an AlphaVectorPolicy
, but the alpha vectors are significant over-estimates.
A single step look-ahead might be a viable way to solve the no-alpha-vector-for-this-action problem, but it would not make value
always return the true optimal value for that action since we may not have the optimal alpha vectors in the part of the belief space that the lookahead explores.
I agree with everything you say, value
neither needs to return the optimal value nor the policy's value. But I still think AlphaVectorPolicy
s without one step look-ahead are only theoretically justified to think about value(policy, b)
and not to think about value(policy, b, a)
by maximizing only over alpha vectors that have a
as associated action. Otherwise, policies that kept dominated alpha vectors would give better estimates of value(policy, b, a)
for sub-optimal actions.
I concede your point that there is no objective theoretical definition of value(p, b, a)
, but some difficulties would come up if we tried to do a backup in every call to value(p, b, a)
including
value(p, b, a)
with a backup would be inefficient if the observation space is large or impossible if it is continuous.maximum(actionvalues(p, b, a))
might be greater than value(p, b)
Also, when Maxime originally implemented actionvalues
, he interpreted it to not include a backup. Overall, these reasons convince me that we should not do a backup in value(p, b, a)
.
This should be implemented, but is not. One question is what to return for an action that doesn't have an alphavector. Should it be
missing
or-Inf
? Probablymissing
. If we make that decision,actionvalues
should probably also be updated.Relevant to https://github.com/JuliaPOMDP/POMDPs.jl/discussions/492#discussioncomment-6096862