Better to use reward(m, s, a, sp, o) and observation(m, s, a, sp)

JuliaPOMDP / PointBasedValueIteration.jl

Point-based value iteration solver for POMDPs

MIT License

6 stars 4 forks source link

Better to use reward(m, s, a, sp, o) and observation(m, s, a, sp) #1

Closed zsunberg closed 4 years ago

zsunberg commented 4 years ago

Hi @dominikstrb one potential improvement is that it is better to use versions of functions with more arguments, e.g. reward(m, s, a, sp, o) or reward(m, s, a, sp) and observation(m, s, a, sp) instead of reward(m, s, a) and observation(m, a, sp). This will allow compatibility with more problems.

dominikstrb commented 4 years ago

Hi @zsunberg, thanks for the suggestion! I am not sure if it is possible to use reward(m, s, a, sp, o) in this case. Looking at the original paper for the algorithm (http://www.fore.robot.cc/papers/Pineau03a.pdf), the backup equation is

So it looks like the reward function cannot depend on the next state or the observation. Am I overlooking someting? I am not all that familiar with POMDP solving methods and basically just tried to stick to the paper.

For the observation function, I have now switched to the more general version.

zsunberg commented 4 years ago

In the latest version of POMDPModelTools, there is a way to automatically cache the results of taking expectations of reward(m, s, a, sp, o): https://juliapomdp.github.io/POMDPModelTools.jl/dev/model_transformations/#State-Action-Reward-Model

# somewhere at the beginning
r = StateActionReward(pomdp)
# when you need to use R(s, a) in the backups
r(s, a)

This will allow the solver to work with models that have reward(m, s, a, sp, o).

dominikstrb commented 4 years ago

Great, thanks! Fixed it.