Closed jonathan-laurent closed 4 years ago
I was asked of this question from at least three persons. So I'd better write down my thoughts here (and in the upcoming new documentation).
Originally, we have a similar API named interact!
written by @jbrea . (Though I'm not sure the reason to use this name, it's just the same with step!
in OpenAI Gym if I understand it correctly).
step!
together with some other functions in OpenAI Gym are undoubtedly becoming the de facto standard APIs in the RL world. And even in Julia, Gym.jl, Reinforce.jl, RLInterface.jl and MinimalRLCore.jl also adopt the step!
there. However, that doesn't mean it is the right way to do so.
In my opinion, one critical issue with step!
is that it conflates two different operations into one: 1. Feed an action to the environment; 2. Get an observation from the environment. (There's a similar problem with pop!
in data structures. But that's another topic though)
In single-agent sequential environments, step!
works well with the following process:
# Given policy and env
observation = reset!(env)
while true
action = policy(observation )
observation , reward, done, info = step!(env, action)
done && break
end
But when it comes to simultaneous environments, we have to change the return type of step!
function slightly:
# Given policy and env
observation = reset!(env)
while true
action = policy(observation )
task = step!(env, action)
observation , reward, done, info = fetch(task)
done && break
end
Until now, it's not a big change. After all, the signature of function in Julia doesn't include the return type. So we can return a Future
instead. (But do note that it already breaks the definition in OpenAI Gym)
Now consider the multi-agent environments, things become much more complicated. See more discussions at openai/gym#934:
reset!
should be triggered by a (global) policyreset!
and step!
must return observations and other info of all the players.no-op
action)The root reason for those complexities is that step!
is synchronous by design. I'm not the first person to realize this problem. In Ray/RLlib, environments are treated async by design, see more details here:
In many situations, it does not make sense for an environment to be “stepped” by RLlib. For example, if a policy is to be used in a web serving system, then it is more natural for an agent to query a service that serves policy decisions, and for that service to learn from experience over time. This case also naturally arises with external simulators (e.g. Unity3D, other game engines, or the Gazebo robotics simulator) that run independently outside the control of RLlib, but may still want to leverage RLlib for training.
So we modified the APIs in OpenAI Gym a little:
reset!
is kept but it should only return nothing
(or at least package users shouldn't rely on the result of reset!
)nothing
. The only reason to use a functional struct here is to reduce the burden of remembering an extra API name ;)observe
the environment independently anytime.
observe
. Generally get_state(obs)
, get_reward(obs)
, and get_terminal(obs)
are required. The fact that the result of observe
can be of any type is useful for some search based algorithms like MCTS.I must admit that treating all environments as async does bring in some inconveniences. For those environments which are sync essentially, we have to store the state, reward, info
in the environment after applying an action. For example, ReinforcementLearningEnvironments.jl#mountain_car. But I think it's worth to do so.
(Also cc @sebastian-engel, @zsunberg, @mkschleg, @MaximeBouton, @jbrea in case they are interested in discussions here.)
Though I'm not sure the reason to use this name
This dates still back to the time, when I didn't have a look at how other RL packages work and I thought the agent is basicallically interacting with the environment, since it sends and action and receives and observation, therefore interact!
. I really like the new convention, that step!
just steps the environment forward and observe
is used to observe the current state of the environment.
Thanks for your thorough explanation!
Am I correct that with your API, calls to observe
, get_terminal
and get_reward
are blocking until all agents have submitted an action? More generally, do you have a canonical example of a simultaneous environment?
Am I correct that with your API, calls to observe, get_terminal and get_reward are blocking until all agents have submitted an action?
It depends. It can be blocking either when calling env(action)
, observe(env)
, or at the first call of get_*(obs)
. For the first case, you can refer ReinforcementLearningBase.jl#MultiThreadEnv. For latter two, I don't have an example yet.
More generally, do you have a canonical example of a simultaneous environment?
Unfortunately, no😳.
Thanks! Also, I was wondering: is there currently a function to reset an environment to a previous state (or equivalently create a new environment with a custom initial state)?
I am asking because although Gym environments typically do not offer this functionality, it is essential for tree-based planning algorithms such as AlphaZero.
is there currently a function to reset an environment to a previous state (or equivalently create a new environment with a custom initial state)?
No. To support this operation we need to separate the environment into two parts: 1) Description part, like action_space
, num_of_player
, etc. 2) Internal state related part. And I'm not sure how to handle it gracefully yet.
Based on my limited experience with MCTS, I found that implementing a Base.clone(env::AbstractEnv)
would be enough to mimic the behavior you said above. I can add one example later. (You may watch https://github.com/JuliaReinforcementLearning/ReinforcementLearningZoo.jl/issues/32)
In POMDPs.jl, we made a different decision where the separation of the state and model is central. Instead of using the step!
paradigm, we have a generative model, so, for instance, if m
is an MDP
object and s
is any state (that you have stored in your tree for instance), you can call
sp, r = @gen(:sp, :r)(m, s, a, rng)
to generate a new state and reward (in POMDPs 0.8.4 + - we are still finalizing some issues and updating documentation to move to POMDPs 1.0). We've also tried to make it really easy to define simple models in a few lines with QuickPOMDPs.
That being said, it is true that using a step!
or RLBase style interface will make it easier to wrap environments that others have written (though it could be done with POMDPs.jl), and the only thing you need to add to a step!
style simulator for it to work with MCTS is the ability to copy the environment, not initialize at any arbitrary state.
In any case, I don't think it will be too hard to adjust to different interfaces in the future. Probably best to just get it to work with one MDP, and then think hard about the interface in the second round. As mentioned in the RLZoo README, Make it work, make it right, make it fast is a good mantra.
I made the choice to go w/ the API in MinimalRLCore for a few reasons. The biggest is just where I'm studying and who I learned RL from initially (i.e. Adam White and at UofA/IU). In our course we heavily use the RLGlue interface which Adam made during his graduate degree w/ Brian Tanner. The API is very much inspired from this and modernized to remove some of the cruft of the original (they had constraints that I don't have to deal w/ in Juia). The focus of MinimalRLCore was also to create an API which lets people do what they need to for research, even if I didn't imagine it initially. I find that I run into walls a lot when adopting an RL API, although Julia helps a lot here w/ multiple dispatch. One example is dealing w/ a non-global RNG which is shared btw the agent and environment, or defining a reset which sets the state to a provided value (very necessary for MonteCarloRollouts when working on prediction).
While it is true the API I provided isn't really designed w/ async in mind, and this was partially on purpose and partially on how I'm actually using this in my research. But users can overload step! for any of their envs that may be async, so I don't really see it as an issue that needs addressed. If this were to be supported later I would probably have a separate abstract type. I don't feel like the assumption should be that all envs async, or that you have multiple agents running around in an env instance (like A3C for example). This usually adds complexity, that I don't really want to deal w/ as a researcher.
Oh man, this is great to get us all in the same room talking :) (@maximebouton @rejuvyesh, @lassepe you might be interested in this). I think we should make an actually really minimal interface that can be used for MCTS and RL and put it in a package (after a quick look, MinimalRLCore and RLInterface are almost there, but not quite). Should we move the discussion to discourse?
I would submit that the minimal interface for MCTS would have:
step!(env, a) # returns an observation, reward, and done
actions(env) # returns only the valid actions at the current state of the environment
reset!(env)
clone(env) # creates a complete copy at the current state - it is assumed that the two environments are now completely independent of each other.
The other option is to explicitly separate the state from the environment.
A popular example of interfaces that have this concept of explicit observation interface as @findmyway mentioned is dm-lab
I believe.
Yeah, I must say that the explicit observation interface in RLBase is a very nice feature for some of the more complicated use cases.
This afternoon, I was thinking about a way to have a common core of basic and some optional functionality that we can all link into. My idea is a CommonRL package that all of our packages that are optimized for different use cases depend on and allows for interoperability at least on the reset!
and step!
level. Here is my sketch: https://gist.github.com/zsunberg/a6cae2f92b5f8fae8f624dc173bc5c6b .
I would 100% be up to helping with this.
One thing that I have issue w/ still in Julia is dealing with the implicit enforcement of interfaces (thus why MinimalRLCore separates what is called and what is implemented by users). But I think if we were to have a common package w/ good docs, this shouldn't be an issue (and I guess I should be more trusting of users :P).
I think having someway of expressing what observation types are being returned would be useful, but I never have landed on a design I like. The dict of types is reasonable, but feels really pythony. I was also playing around w/ the idea of dispatching on value types with symbols, this was a bit onerous though. Maybe we should use traits here.
@zsunberg 's sketch is a really nice starting point. I'd also be glad to support such common core package.
The dict of types is reasonable, but feels really python. I was also playing around w/ the idea of dispatching on value types with symbols, this was a bit onerous though. Maybe we should use traits here.
@mkschleg I'm feeling the same 😄.
@zsunberg I also like the idea of a common core package!
@mkschleg:
I think having someway of expressing what observation types are being returned would be useful, but I never have landed on a design I like. The dict of types is reasonable, but feels really pythony. I was also playing around w/ the idea of dispatching on value types with symbols, this was a bit onerous though. Maybe we should use traits here.
What I like the best so far is to have observe
return a named tuple or a custom structure. With these we could use traits like has_state(observation::NamedTuple{N,T}) where {N, T} = :state in N
with fallback has_state(o) = hasproperty(o, :state)
. To help the developer of a new environment capture API expectations early on, I like the idea of having test functions, like e.g. basic_env_test, that could also throw warnings like has_state(observation) || @warn "Many algorithms expect observations to return a field named :state."
.
Ok, great, I think the common core package should live in the JuliaReinforcementLearning org. Can you invite me to the org, @findmyway ? Thanks.
I think having someway of expressing what observation types are being returned would be useful
@mkschleg Do you mean the caller of step!
chooses the type to be returned, or the environment communicates to the caller which type it will return?
I was also playing around w/ the idea of dispatching on value types with symbols, this was a bit onerous though. Maybe we should use traits here.
If I understand what you're saying correctly, we do this in POMDPs.jl, haha. For example you can use
sp, o = @gen(:sp, :o)(m, s, a, rng)
where m
is a POMDP, to get the next state and observation, or
o, r = @gen(:o, :r)(m, s, a, rng)
to get the next observation and reward. The macro expands to a call that dispatches on a value type with symbols. It works pretty well, but is a bit esoteric - you have to know what the symbols mean.
@jbrea Could you help to send the invitation?
@jbrea , @findmyway , it looks like you invited me to collaborate on the ReinforcementLearning.jl package - I was hoping to join the JuliaReinforcementLearning org so that I can create a new package owned by the org.
@zsunberg, sure; sorry, github has too many buttons :stuck_out_tongue_winking_eye:
@jbrea The named tuple is reasonable. I've been having this as an option for agents as well to make evaluation a bit easier for some of the wrapping functionality (like running episodes).
@zsunberg The way you do it in POMDP.jl is interesting! I hadn't quite dug into as much yet, but I should prioritize.
What I have been doing is something like
struct Env{V<:Val}
dispatch_on::V
end
And dispatch on specific value types. Definitely not the best way to do it. But it has been useful when there are several observation types for an environment (like Atari w/ color and BW frames).
I'd be happy to help out with this and help refine the interface. I'd love to have a common core that I can just pull from rather than have to maintain my own. So if you are looking for collaborators on the repo let me know.
Alright - start filing issues!! https://github.com/JuliaReinforcementLearning/CommonRLInterface.jl
@mkschleg hmm, yeah that seems like a reasonable way to do it. Although, options like color vs black and white frames are very domain-specific, so I'm not sure they belong in this interface. It would make sense to have a general way to deal with data type expectations (e.g. AbstractArray{Float32} vs AbstractArray{Float64}) Feel free to file an issue on that repo to discuss further.
Thinking about the traits thing a bit more. I'm not sure it belongs in the base interface. The designer of the environment will be able to manage this through using traits/dispatch. The interface doesn't have to plan for it (Yay Julia!)
Thanks for all the discussions here.
I removed the observation layer in the latest version, making the environment more transparent to agents/policies.
Support of CommonRLInterfaces.jl is also included in https://github.com/JuliaReinforcementLearning/ReinforcementLearningBase.jl/pull/58
In the next minor release, I hope ReinforcementLearningBase.jl and CommonRLInterfaces.jl can converge to a stable one after experimenting with more algorithms.
In the documentation for
AbstractEnv
, you write the following remark:Would you care to elaborate what you mean here?