Open MaximeBouton opened 4 years ago
Yeah. We also need an interface for custom decay schedules for eps
Hmm... yes this is a good question.
I think the third option is reasonable. We might consider calling it ExplorationStrategy
instead (an ExplorationStrategy
might contain a Policy
that it uses for exploration).
I think action
is an acceptable name for the function. The meaning of action
is just "select an action based on the inputs", so I don't think it clashes too badly with action(::Policy, s)
. Although it seems like the exploration strategy would often contain some state, like the position in the epsilon decay schedule, so maybe the name should have a !
in it.
We should also think about exactly what the arguments should be. Is env
needed? What if we pass in the on-policy action instead of current_policy
, i.e.
action(::ExplorationStrategy, on_policy_action, obs, rng)
We could also consider leaving out the on-policy action from the call altogether and say, if action
returns nothing
, use the on-policy action.
(note I might be saying some of the wrong words because I have less experience with RL)
We can't pass in just the actions because for certain exploration strategies like Softmax
one needs the q values for all actions. Otherwise the idea sounds reasonable. Should that be another package then?
We can't pass in just the actions because for certain exploration strategies like Softmax one needs the q values for all actions.
Ah, I see - that makes sense
Should that be another package then?
Did you mean "Should that be in another package then?" or "Should we create another package?"
I think yes, it should be somewhere besides one of the learning packages, but I would hope to not create a new one. My philosophy on packages has changed a lot since we broke up POMDPToolbox. Now I think it would have been much better to create better documentation and perhaps use submodules than to have a bunch of small packages!
For now I think this could live in POMDPPolicies
, I might take a shot at it next week.
Do we really want to have action
and action!
?
I think it might be confusing and I am not sure we are really respecting that now, if you have an MCTS policy, we are modifying internals fields of the policy object and we still call action
.
Suggestion:
abstract type AbstractSchedule end # define linear decay, exponential decay and more
function update_value!(::AbstractSchedule, ::Real)
mutable struct EpsGreedyPolicy <: Policy
eps::Real
schedule::AbstractSchedule
policy::Policy
rng::AbstractRNG
actions::Vector{A}
end
function action(p::EpsGreedyPolicy, s)
update_value!(p.schedule, p.eps) # update the value of epsilon according to the schedule
if rand(p.rng) < p.eps
return rand(p.rng, p.actions)
else
return action(p.policy)
end
end
Should update_value!
should have any restriction on the second argument?
No! I will submit a proper PR next week but it is just to give you an overview of the idea
What would be a good interface for specifying the exploration policy?
It is implemented differently here and in
DeepQLearning.jl
.EpsGreedyPolicy
and uses the internal of that policy to access the Q value. I think it is pretty bad:EpsGreedyPolicy
should be agnostic to the type of policy for the greedy part (right now it assumes a tabular policy I think), if we improveEpsGreedyPolicy
then the code here will break.DeepQLearning.jl
, the user must pass in a functionf
andf(policy, env, obs, global_step, rng)
will be called to return the action. I took inspiration from MCTS.jl for this. However it is not super convenient to define decaying epsilon schedule with this approach.action(::ExplorationPolicy, current_policy, env, obs, rng)
. Dispatching on the type ofExplorationPolicy
and having users implement their own type seems more julian than passing a function. The methodaction
is not super consistent with the rest of the POMDPs.jl interface since it takes the current policy and the environment as input.Any thoughts?