value(policy, b, a) for AlphaVectorPolicy

JuliaPOMDP / POMDPs.jl

MDPs and POMDPs in Julia - An interface for defining, solving, and simulating fully and partially observable Markov decision processes on discrete and continuous spaces.

http://juliapomdp.github.io/POMDPs.jl/latest/

Other

652 stars 99 forks source link

value(policy, b, a) for AlphaVectorPolicy #504

Open zsunberg opened 1 year ago

zsunberg commented 1 year ago

This should be implemented, but is not. One question is what to return for an action that doesn't have an alphavector. Should it be missing or -Inf? Probably missing. If we make that decision, actionvalues should probably also be updated.

Relevant to https://github.com/JuliaPOMDP/POMDPs.jl/discussions/492#discussioncomment-6096862

zsunberg commented 1 year ago

missing and argmax do not play well together...

julia> argmax([1.0, missing, 3.0])
2

julia> argmin([1.0, missing, 3.0])
2

zsunberg commented 1 year ago

We could use nothing:

julia> argmax([1.0, nothing, 3.0])
ERROR: MethodError: no method matching isless(::Float64, ::Nothing)

julia> argmax(something.([1.0, nothing, 3.0], -Inf))
3

but returning nothing from value does not seem right.

zsunberg commented 1 year ago

I guess maybe -Inf is actually the right thing because this policy considers any actions without an alphavector to be infinitely bad?

johannes-fischer commented 1 year ago

I think this is problematic even in the case where every action has an alpha vector: value(policy, b, a) will only be correct for the optimal action, but might be wrong for dominated actions if you are just maximizing over the alpha vectors corresponding to a.

For example in TigerPOMDP, consider a belief b for which it is optimal to open a door. If you want to compute value(policy, b, :listen) the alpha vectors corresponding to :listen give a suboptimal action value estimate for :listen in this belief b. The alpha vector for listening in b is plotted in red below and is globally dominated (on reachable beliefs) by other alpha vectors, but not dominated if you only consider :listen. So value(policy, b, :listen) should be approx 25.4 and not 24.7.

tiger_action_values tiger_action_values_zoom

I think to implement this correctly a 1-step lookahead for a is necessary, which would also solve the issue of actions that don't have alpha vectors.

zsunberg commented 1 year ago

value is not required to return the true value of the policy, it is just what the policy thinks the value is. For example, QMDP returns an AlphaVectorPolicy, but the alpha vectors are significant over-estimates.

A single step look-ahead might be a viable way to solve the no-alpha-vector-for-this-action problem, but it would not make value always return the true optimal value for that action since we may not have the optimal alpha vectors in the part of the belief space that the lookahead explores.

johannes-fischer commented 1 year ago

I agree with everything you say, value neither needs to return the optimal value nor the policy's value. But I still think AlphaVectorPolicys without one step look-ahead are only theoretically justified to think about value(policy, b) and not to think about value(policy, b, a) by maximizing only over alpha vectors that have a as associated action. Otherwise, policies that kept dominated alpha vectors would give better estimates of value(policy, b, a) for sub-optimal actions.

zsunberg commented 1 year ago

I concede your point that there is no objective theoretical definition of value(p, b, a), but some difficulties would come up if we tried to do a backup in every call to value(p, b, a) including

It would involve iterating over the observation spaces. Some algorithms (e.g. QMDP) avoid or approximate the observation space so they can produce alpha vectors even when the observation space is large or continuous. value(p, b, a) with a backup would be inefficient if the observation space is large or impossible if it is continuous.
If the alpha vectors are approximate, it could be the case that a backup will reveal an action that has a higher value, so maximum(actionvalues(p, b, a)) might be greater than value(p, b)

Also, when Maxime originally implemented actionvalues, he interpreted it to not include a backup. Overall, these reasons convince me that we should not do a backup in value(p, b, a).