Interface for exploration policy

MaximeBouton commented 4 years ago

What would be a good interface for specifying the exploration policy?

It is implemented differently here and in DeepQLearning.jl.

What is implemented here: Just allows a limited set of possible policy e.g. EpsGreedyPolicy and uses the internal of that policy to access the Q value. I think it is pretty bad: EpsGreedyPolicy should be agnostic to the type of policy for the greedy part (right now it assumes a tabular policy I think), if we improve EpsGreedyPolicy then the code here will break.
In DeepQLearning.jl, the user must pass in a function f and f(policy, env, obs, global_step, rng) will be called to return the action. I took inspiration from MCTS.jl for this. However it is not super convenient to define decaying epsilon schedule with this approach.
A suggestion is to use a function action(::ExplorationPolicy, current_policy, env, obs, rng). Dispatching on the type of ExplorationPolicy and having users implement their own type seems more julian than passing a function. The method action is not super consistent with the rest of the POMDPs.jl interface since it takes the current policy and the environment as input.

Any thoughts?

rejuvyesh commented 4 years ago

Yeah. We also need an interface for custom decay schedules for eps

zsunberg commented 4 years ago

Hmm... yes this is a good question.

I think the third option is reasonable. We might consider calling it ExplorationStrategy instead (an ExplorationStrategy might contain a Policy that it uses for exploration).

I think action is an acceptable name for the function. The meaning of action is just "select an action based on the inputs", so I don't think it clashes too badly with action(::Policy, s). Although it seems like the exploration strategy would often contain some state, like the position in the epsilon decay schedule, so maybe the name should have a ! in it.

We should also think about exactly what the arguments should be. Is env needed? What if we pass in the on-policy action instead of current_policy, i.e.

action(::ExplorationStrategy, on_policy_action, obs, rng)

We could also consider leaving out the on-policy action from the call altogether and say, if action returns nothing, use the on-policy action.

(note I might be saying some of the wrong words because I have less experience with RL)

rejuvyesh commented 4 years ago

We can't pass in just the actions because for certain exploration strategies like Softmax one needs the q values for all actions. Otherwise the idea sounds reasonable. Should that be another package then?

zsunberg commented 4 years ago

We can't pass in just the actions because for certain exploration strategies like Softmax one needs the q values for all actions.

Ah, I see - that makes sense

Should that be another package then?

Did you mean "Should that be in another package then?" or "Should we create another package?"

I think yes, it should be somewhere besides one of the learning packages, but I would hope to not create a new one. My philosophy on packages has changed a lot since we broke up POMDPToolbox. Now I think it would have been much better to create better documentation and perhaps use submodules than to have a bunch of small packages!

MaximeBouton commented 4 years ago

For now I think this could live in POMDPPolicies, I might take a shot at it next week.

MaximeBouton commented 4 years ago

Do we really want to have action and action! ? I think it might be confusing and I am not sure we are really respecting that now, if you have an MCTS policy, we are modifying internals fields of the policy object and we still call action.

MaximeBouton commented 4 years ago

Suggestion:

abstract type AbstractSchedule end # define linear decay, exponential decay and more 

function update_value!(::AbstractSchedule, ::Real) 

mutable struct EpsGreedyPolicy <: Policy
    eps::Real
    schedule::AbstractSchedule
    policy::Policy
    rng::AbstractRNG
    actions::Vector{A}
end

function action(p::EpsGreedyPolicy, s) 
    update_value!(p.schedule, p.eps) # update the value of epsilon according to the schedule
    if rand(p.rng) < p.eps
        return rand(p.rng, p.actions)
    else 
        return action(p.policy)
    end
end

rejuvyesh commented 4 years ago

Should update_value! should have any restriction on the second argument?

MaximeBouton commented 4 years ago

No! I will submit a proper PR next week but it is just to give you an overview of the idea

JuliaPOMDP / TabularTDLearning.jl

Interface for exploration policy #10