Add two function to EpsGreedyPolicy let it select action by current state.

NeroBlackstone commented 1 year ago

This work is based on this discussion. I have finished this work, this gist is a runnable EpsGreedyPolicy demo, include a grid world mdp. When state at (1,1), (bottom left corner), EpsGreedyPolicy receive [RIGHT,UP] as actions. So agent will not hit the wall.

## add actions argument
function MyEpsGreedyPolicy(problem, eps::Function, actions;
    rng::AbstractRNG=Random.GLOBAL_RNG)
    return MyEpsGreedyPolicy(eps, rng, actions)
end

function MyEpsGreedyPolicy(problem, eps::Real, actions;
    rng::AbstractRNG=Random.GLOBAL_RNG)
    return MyEpsGreedyPolicy(x -> eps, rng, actions)
end
##

If you think this is the right way, I want to open a pull request.

Thanks your code review.

NeroBlackstone commented 1 year ago

And another solution I think better is add a Keyword Arguments for EpsGreedyPolicy POMDPs.action

function POMDPs.action(p::MyEpsGreedyPolicy, on_policy::Policy, k, s; getValidActions::Function = x -> x)
    va = getValidActions(s)
    if rand(p.rng) < p.eps(k)
        return rand(p.rng, va)
    else
        return action(on_policy, s)
    end
end

Then call these codes in solver:

function validActions(s::State)::Vector{Action}
    ......
end

exploration_policy = MyEpsGreedyPolicy(mdp,0.9)
a = action(exploration_policy, on_policy, 0, s, getValidActions = validActions)

So you don't need to construct another EpsGreedyPolicy in solver.

What do you think? Is this one better? I'm willing to contribute.

ps: action(on_policy, s) still taken some action not include in validActions function return, but I think this is nothing about ExplorationPolicy, it must implement in on_policy. So I do not change these codes.

NeroBlackstone commented 1 year ago

I think both these two solutions are acceptable, solver programming will have more flexibility Maybe we need to add both of these solutions.

zsunberg commented 1 year ago

Hi @NeroBlackstone Thanks for your contribution and sorry for the delay in responding!

(PO)MDPs themselves can have state-dependent actions spaces, see https://juliapomdp.github.io/POMDPs.jl/stable/def_pomdp/#state-dep-action . So the information about which actions can be taken from which state should be stored within the problem.

So, I think it would be best to just do the following:

Modify EpsGreedyPolicy so that it stores the problem in field m.

Modify action so that it calls the actions function with the state argument.

function POMDPs.action(p::MyEpsGreedyPolicy, on_policy::Policy, k, s)
va = actions(p.m, s) # <----- This is the important part
if rand(p.rng) < p.eps(k)
    return rand(p.rng, va)
else
    return action(on_policy, s)
end
end

Does that seem like a good solution?

NeroBlackstone commented 1 year ago

Hi! @zsunberg , Thanks for your review, and sorry for my delayed response. I did not notice that POMDPs.jl already supports state-dep-action, your solution is really great and inspiring.

If adding field m, we may want to remove filed actions, since the actions can get from (po)mdp.

What about this:

struct MyEpsGreedyPolicy{T<:Function,R<:AbstractRNG} <: ExplorationPolicy
    eps::T
    rng::R
    m::Union{MDP,POMDP}
end

function MyEpsGreedyPolicy(problem::Union{MDP,POMDP}, eps::Function;
    rng::AbstractRNG=Random.GLOBAL_RNG)
    return MyEpsGreedyPolicy(eps, rng, problem)
end
function MyEpsGreedyPolicy(problem::Union{MDP,POMDP}, eps::Real;
    rng::AbstractRNG=Random.GLOBAL_RNG)
    return MyEpsGreedyPolicy(x -> eps, rng, problem)
end

function POMDPs.action(p::MyEpsGreedyPolicy, on_policy::Policy, k, s)
    if rand(p.rng) < p.eps(k)
        return rand(p.rng, actions(p.m,s))
    else
        return action(on_policy, s)
    end
end

I have updated gridworld runnable demo, and MyEpsGreedyPolicy work well, it does not break exist solvers at all.

I still think adding the whole environment model to EpsGreedyPolicy is a little "heavey", but I don't have any better idea.

Do you think it's a accecptable solution?

Thanks for your code review again :)

NeroBlackstone commented 1 year ago

Another related issue I want to put here is about document improvement

State- or belief-dependent action spaces

actions = function (s)
    if s == 1
        return [1,2,3]
    else
        return [2,3]
    end
end

In this code block, I think this example is better:

actions = function (s = nothing)
    if  s = 1
        return [2,3]   #<--- return state-dep-actions here
    else
        return [1,2,3]     #<--- return full action spaces here
    end
end

Add = nothing because original (po)mdp definition will make actions(mdp) throws an error. We can't get full action space from that function.

zsunberg commented 1 year ago

Great! the only change I would suggest is the following

struct MyEpsGreedyPolicy{T<:Function,R<:AbstractRNG,M<:Union{MDP,POMDP}} <: ExplorationPolicy
    eps::T
    rng::R
    m::M
end

This will result in much faster code. (if you don't understand why, feel free to ask)

I like your change to the documentation - can you put it in a separate PR?

NeroBlackstone commented 1 year ago

I have opened a pr for this code change.

struct MyEpsGreedyPolicy{T<:Function,R<:AbstractRNG,M<:Union{MDP,POMDP}} <: ExplorationPolicy
    eps::T
    rng::R
    m::M
end
This will result in much faster code. (if you don't understand why, feel free to ask)

You are so kind, could you tell me why this change will make the code faster? I'm curious.

I like your change to the documentation - can you put it in a separate PR?

Okay, I have changed the document in this pr.

JuliaPOMDP / POMDPs.jl

Add two function to EpsGreedyPolicy let it select action by current state. #449