API proposal - Githubissues

sglyon commented 8 years ago

After a chat with @tbreloff, we roughly decided on the following core API methods:

actions(::AbstractEnvironment, s::State) -> A(s): This would return the set A(s) that contains all valid actions from the state s
step(::AbstractEnvironment, a::Action) -> s', r: This method returns the next state s' and the reward r associated with taking action a. Note that the current state s is a field of the AbstractEnvironment. This method is also responsible for setting this state field to be s'
action(::AbstractPolicy, s'::State, r::Reward, A(s')::Actions) -> a': This method should be implemented by each subtype of AbstractPolicy and should observe a state transition (result of the calling step method with the previous action a) and output the next action a'.

@tbreloff also mentioned an episode(::AbstractEnvironment) method, but I wasn't clear on its purpose so I'll let him fill in the blanks there.

sglyon commented 8 years ago

We should also have a reset!(::AbstractEnvironment) method.

Also we should discuss "mandatory" fields for an environment.

Evizero commented 8 years ago

Do you guys know about https://github.com/JuliaPOMDP/POMDPs.jl ? POMDPs are a pretty related problem to RL. could be some synergy there

tbreloff commented 8 years ago

Thanks... I didn't know about that org. Certainly related... I'll have to review it.

On Monday, August 8, 2016, Christof Stocker notifications@github.com wrote:

Do you guys know about https://github.com/JuliaPOMDP/POMDPs.jl ? POMDPs are a pretty related problem to RL. could be some synergy there

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tbreloff/Reinforce.jl/issues/2#issuecomment-238164425, or mute the thread https://github.com/notifications/unsubscribe-auth/AA492oJdKCUsrw-6tbPVUJJFu8BO0tANks5qdt8VgaJpZM4Jdcsq .

tbreloff commented 8 years ago

After reviewing what's in JuliaPOMDP, I'd like to move forward with what we've already discussed, and possibly write some (conditionally included) code to link our abstractions to the JuliaPOMDP world. They have large web of interdependencies, most of which are not registered packages, and their own package manager to handle it. I'm weary of depending on that, and I'd prefer to make our own path.

In addition, they have much more of a focus (seemingly) on finite, table-based, offline solvers, and I don't know if there's really that much overlap beyond some of the verbs and test environments.

tl;dr Someday we should revisit whether we can merge, but for now I think it's easier to stay arms-length.

tbreloff commented 8 years ago

@spencerlyon2 I'm thinking about how/where we could convey that there are no more steps to take (i.e. the episode is done). I wonder if we can fit environments into the iteration pattern, or maybe that an "episode" is a iterator/wrapper around an environment and policy:

immutable Episode{E,P}
    env::E
    policy::P
    total_reward::Float64
    niter::Int
end

# todo: start, next, done definitions

for (s,a,r) in Episode(env, policy)
    # do something, or not?
end

tbreloff commented 8 years ago

Also thinking we need the state as input to step, and that it should be mutating:

r, s′ = step!(env, s, a)

I'd like for a "sarsa" update to make sense:

# assume we have (s,a) for this iteration already (it's last iteration's a′)
r, s′ = step!(env, s, a)
done(env) && break # ??
a′ = action(policy, r, s′, actions(env, s′)

sglyon commented 8 years ago

step(env, s, a) sounds good to me.

What would mutate in a call to step!?

zahachtah commented 8 years ago

I just wanted to say that I am hugely excited about the prospect of machine learning API with different backends similar to the absolutely amazing Plots.jl. I am not good enough to contribute substantially to this except from a users point of view, but hugs to you all for this :-)

tbreloff commented 8 years ago

Presumably the call to step! is expected to mutate the environment in some way. There may be some cases where that's not true, but it's probably the exception.

tbreloff commented 8 years ago

I just wanted to say that I am hugely excited about the prospect of machine learning API with different backends similar to the absolutely amazing Plots.jl.

Thanks... yeah I hope all works out well.

I am not good enough to contribute substantially to this except from a users point of view

If you have time to contribute, we could use help with QA... either writing tests, or just using and reporting back on problems/suggestions!

tbreloff commented 8 years ago

I made many of the changes we discussed, and wrote out a bare minimum readme. Please review when you have time @spencerlyon2. No hurry.

pkofod commented 8 years ago

FWIW I think the current API is very easy to understand, and I think it will be quite easy for users to use it for their own extensions. One thing I am not quite sure of is that of check_constraints at https://github.com/tbreloff/Reinforce.jl/blob/a9218995afb699846f48dcf06695d0156278f47e/src/episodes.jl#L31. Maybe I'm just not used to this, but what if the constraints are not satisfied? Shouldn't a new action be chosen? Or am I misunderstanding the function names here?

tbreloff commented 8 years ago

@pkofod I agree. maybe this line could be a = check_constraints(...) instead??

pkofod commented 8 years ago

Yes, something like that. Would check_constraints then potentially be a recursion, or? I mean a new a could violate the constraints as well, and then check_constraints would have to be called in check_constraints, or?

Edit: alternatively something like

function Base.next(ep::Episode, i)
    env = ep.env
    s = state(env)
    passed = false
    while !passed
        a = action(ep.policy, reward(env), s, actions(env))
        passed = check_constraints(env, s, a)
    end
    r, s′ = step!(env, s, a)
    ep.total_reward += r
    ep.niter = i + 1
    (s, a, r, s′), i+1
end

tbreloff commented 8 years ago

I'd be ok throwing an error too. In theory the agent should be choosing something from the valid set of actions.

On Thursday, September 15, 2016, Patrick Kofod Mogensen < notifications@github.com> wrote:

Yes, something like that. Would check_constraints then potentially be a recursion, or? I mean a new a could violate the constraints as well, and then check_constraints would have to be called in check_constraints, or?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tbreloff/Reinforce.jl/issues/2#issuecomment-247470502, or mute the thread https://github.com/notifications/unsubscribe-auth/AA492jHy9hqyMfLJJ7kvMSxeSPqHtwFrks5qqcNlgaJpZM4Jdcsq .

sglyon commented 8 years ago

When I wrote that bit I was thinking that we we should throw an error.

Really I want a way to communicate that the action space maybe current state dependent.

tbreloff commented 8 years ago

The agent is given a valid action set in the line before, so I think it's fine to throw an error.

On Thursday, September 15, 2016, Spencer Lyon notifications@github.com wrote:

When I wrote that bit I was thinking that we we should throw an error.

Really I want a way to communicate that the action space maybe current state dependent.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tbreloff/Reinforce.jl/issues/2#issuecomment-247506201, or mute the thread https://github.com/notifications/unsubscribe-auth/AA492ovMetfjhdNNwhZCRgLhgFbQFfEpks5qqgLxgaJpZM4Jdcsq .

pkofod commented 8 years ago

Of course. My question is then, is it really necessary? Is it just "to be sure"? I mean, it will only error if the actions method for the subtype of AbstractPolicy is incorrectly implemented.

sglyon commented 8 years ago

@tbreloff I'm probably missing something obvious here, but In that routine I don't see where we get an action set based on the current state.

What I have in mind is that each period s \in \mathcal{S} and the agent must pick a = A(s) \subseteq \mathcal{A} (a la Barto & Sutton).

I think that a better way to do this would be to define actions(::Env, ::State) so that the action set passed to the actor is always valid given the current state. Then there would be no need for the check_constraints routine.

Thoughts?

pkofod commented 8 years ago

Makes sense to have env in actions to constrain the actions before picking one .

tbreloff commented 8 years ago

In my mind that's already true because I assume the environment knows the state, so actions(env) can be based on the state if appropriate. But I agree it would be more general (and allow for dispatch) if the call was 'A = actions(env, state)' instead.

The check_constraints could just '@assert a in A'

Also I can't remember if I switched it yet but I added "math sets" to LearnBase and we should be returning those from 'actions'.

On Friday, September 16, 2016, Spencer Lyon notifications@github.com wrote:

@tbreloff https://github.com/tbreloff I'm probably missing something obvious here, but In that routine I don't see where we get an action set based on the current state.

What I have in mind is that each period s \in \mathcal{S} and the agent must pick a = A(s) \subseteq \mathcal{A} (a la Barto & Sutton).

I think that a better way to do this would be to define actions(::Env, ::State) so that the action set passed to the actor is always valid given the current state. Then there would be no need for the check_constraints routine.

Thoughts?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tbreloff/Reinforce.jl/issues/2#issuecomment-247563160, or mute the thread https://github.com/notifications/unsubscribe-auth/AA492pwwvLAXc912GVPmfbDKoUyNqViWks5qqmhTgaJpZM4Jdcsq .

sglyon commented 8 years ago

Ok I didn't understand that. We should make a note somewhere so that is very clear to users (and devs).

I can't recall, do we already have a way to describe the action space for the env (the \mathcal{A} I talked about earlier)?

sglyon commented 8 years ago

Note that the action space is different from the set of appropriate actions, given the state. Sorry for the double message, I just didn't think I made that clear

tbreloff commented 8 years ago

I don't remember if we distinguish actions from action_space, but that certainly makes sense.

On Friday, September 16, 2016, Spencer Lyon notifications@github.com wrote:

Note that the action space is different from the set of appropriate actions, given the state. Sorry for the double message, I just didn't think I made that clear

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tbreloff/Reinforce.jl/issues/2#issuecomment-247573473, or mute the thread https://github.com/notifications/unsubscribe-auth/AA492vGQA0sh0Eav99ccllRQQBS5KxScks5qqnZPgaJpZM4Jdcsq .

tbreloff commented 8 years ago

I'm going to make the following API changes:

actions(env) --> A
done(env) --> bool

# becomes:

actions(env, s) --> A
finished(env, s′) --> bool

and I'll change check_constraints into an assert.

tbreloff commented 8 years ago

ref: f21b5cf

tbreloff commented 8 years ago

I want to link the solvers to the work I'm doing in StochasticOptimization, so there's still a lot of changes to be made here, but hopefully one could develop the environments without too much changing.

JuliaML / Reinforce.jl

API proposal #2