Closed FlyingWorkshop closed 6 months ago
I'm not sure about the history of the ExplorationPolicy
abstract type, but it doesn't look like it is constructed to work with built-in simulators like stepthrough
.
Most of the simulators call action_info(policy, state)
to get the action (note: action_info
calls action(policy, state)
and returns nothing
for the info by default: link).
From the documentation for the ExplorationPolicy
type,
Sampling from an exploration policy is done using
action(exploration_policy, on_policy, k, state)
. Wherek
is used to determine the exploration parameter.
Based on the current documentation, this behavior is expected. However, there is probably a good argument to redefine how we construct the exploration policies to include the on_policy
and k
as part of the struct. Then we could define action(policy::ExplorationPolicy, state)
appropriately based on the above comment.
Since I am not familiar with the background here in the development, I am not confident about any secondary issues as it would be a breaking change since we would be redefining the structs of those policies.
Also, reference #497
Yeah, the exploration policy interface was designed for reinforcement learning solvers where the exploration should be decayed, but it is not really a Policy
. I would not object to a re-design of that interface.
If you just want an epsilon greedy policy for a rollout. I'd recommend:
struct MyEpsGreedy{M, P} <: POMDPs.Policy
pomdp::M
original_policy::P
epsilon::Float64
end
function POMDPs.action(p::MyEpsGreedy, s)
if rand() < p.epsilon
return rand(actions(p.pomdp))
else
return action(p.original_policy, s)
end
end
policy = MyEpsGreedy(pomdp, original_policy, 0.05)
Closing. Please continue the discussion at https://github.com/JuliaPOMDP/POMDPs.jl/issues/497.
I'm trying to sample beliefs using the implemented exploration policies (SoftmaxPolicy and EspGreedyPolicy), but they don't work with
stepthrough
or the other simulator techniques that I've tried.Steps to recreate:
Error: