Closed jbrea closed 4 years ago
I think it would be reasonable to add observe
. We may also want to add interact!(env, a)
or maybe just act!(env, a)
that does not return anything.
However, I think we should keep step!
returning the standard four things that people usually expect.
I didn't fully understand why interact!
/observe
is better for the asynchronous case. Can't you just do the following?
result = @async step!(env, a)
o, r, done, info = fetch(result)
The problem is in multi-agent environments. Let's take the tic-tac-toe environment for example:
# given policy_X, policy_O, env
obs_0, reward_0, done_0, info_0 = reset!(env)
action_x = policy_x(o)
obs_1, reward_1, done_1, info_1 = step!(env, action_x)
OK, now the question is, what's obs_1
? Is it the observation from player X
's perspective? Then how do player O
get its observation (obviously it's not obs_0
)? So one solution I've seen is
step!(env, action)
contains info of all players, and in each ply every player gets the necessary info from it.Like I said in the original answer, the problem of step!
is it combines two steps in one. And I personally think it's more intuitive to do env(action)
and then observe(env)
. And we really don't need to care if the env
is sync or async.
Anyway, my point is, we don't have any technical debt now and it's a good chance to review some existing conventions. 😃
In the tic-tac-toe example, the observation of the board at the end of each play is independent of player perspective. Every agent sees the observation in the same way, but the issue is that an observation is needs to be the board after all agents have interacted with the environment. That's a blocking operation, and it doesn't matter if step!
blocks or if interact!
doesn't block but observe
does.
But all those details are beyond the scope of an interface. I think the right way to handle it is with an environment that handles blocking etc. correctly for multi-agent use, but those are implementation details.
That being said, I am in favor of interact!
+ observe
, because
step!
is easily defined by calling those two in sequencestep!
forces a restriction on environment writers whereas decoupling the two parts of step!
does not
Ultimately, (2) is more flexible, and it is better for the interface to be flexible.Okay but another example is a 3D environment with multiple agents. I don't think step!
vs interact!
/observe
is the key issue in this case. Here, each agent does see a different perspective (e.g. visual stimulus). Is observe(env)
a sufficient interface? The crux is what does the returned observation mean? Intuitively, it should be the perspective of the agent that called observe
, but how does the environment know that?
One solution is for the environment to be "distributed." A single 3D environment chopped up into subperspectives. The env
that each agent is using in code is a subperspective. But again, these are implementation details. I don't feel that the interface can strongly influence the correct asynchronous design.
The env that each agent is using in code is a subperspective.
I agree. I think the best way to create a multiplayer environment is to create multiple connected environments - each one looks like a single-player environment to that agent.
@findmyway are you advocating replacing step!
with interact!
/observe
? Or just adding them as optional interface functions?
What is the behavior of observe
if it is called multiple times between calls to interact!
?
What is the behavior of
observe
if it is called multiple times between calls tointeract!
?
It depends. In most common cases, they are exactly the same. But in client-server style environments, the observations might differ as time goes by. In multiagent environments, it depends on if other agents already interact!
with the environment.
@findmyway are you advocating replacing
step!
withinteract!
/observe
? Or just adding them as optional interface functions?
I'd prefer to replace step!
with observe
.
Is observe(env) a sufficient interface? The crux is what does the returned observation mean? Intuitively, it should be the perspective of the agent that called observe, but how does the environment know that?
For multi-agent environments, we need to expand observe(env)
a little to allow observe(env, player)
. Then observe(env)
in multi-agent environment means the observation from an bird's-eye-view (may be useful in imperfect-information environments?).
If observe
gets called twice between calls to interact!
and both times the reward returned by the observe
call is 1.0, has the agent accrued 1.0 reward or 2.0?
I would prefer to stick with step!
for the required interface since it is the de facto standard, and I think it is best to have zero barriers to immediately understanding the package. In the basic RL case, where the world is abstracted as a (PO)MDP, I don't think step!
is really conflating two things. For every step, you take an action and get an observation and reward; you cannot choose whether to or when to observe.
Side note: regarding "it's a good chance to review some existing conventions." Yeah, I think there is a lot of room for improvement if we don't stick to conventions. (E.g. personally, I think a really Julian thing would be to not require the environment to be mutable, so you would use step
instead of step!
- this might help with things like differentiability/ putting it on specialized hardware, etc.) However, I think the required interface of this package is not the best place for that. The required interface should be the conservative base that we all build on, since we don't even have that yet :smile:
If we want to consider replacing step!
in the required interface, I suggest we label this as a decision thread. I think also discussing whether and how we want to explicitly support multi-agent environments should be it's own thread.
@rejuvyesh, what are your thoughts? You have up/down-voted a few comments, but it's hard to tell which parts of the comments you are reacting to :smile:
Also @maximebouton, @mkschleg, if you have any comments on step!
vs interact!/observe
, they would be much appreciated.
An advantage of observe
, whether it is optional or required, is that it might allow for observation configuration in the future.
If step!
is just interact!
with returns, i.e. step!(env, a) = begin interact!(env, a); observe(env) end
, I would go with step!
mandatory and observe
optional. Like this we get the de facto standard for many cases (step!
) and the flexibility for multi-agent settings (step!
without using the returns and observe
whenever needed). Wouldn't this work @findmyway ?
Yes, I think it will work.
@zsunberg I was disagreeing with this:
I think the best way to create a multiplayer environment is to create multiple connected environments - each one looks like a single-player environment to that agent.
There are many factors in a multi-agent environment, but it's not as simple as connecting multiple single-agent environments. This idea of having a separate observe
is also pretty useful if we are interested in modeling more real-time systems where the environment has its own clock and the agents can observe and interact however they see fit.
I think step!
should be the "standard" way of implementing an environment, and should be part of the required interface.
It seems like the observe
, interact!
workflow serve more specific use cases, and most tinkerers will not use it.
However I think observe
is an important concept and allows the following things:
step!
(might be cheaper) observe(env, player)
as mentioned before, I think we need the two methods (with observe(env)
)The difference between observe
and interact!
is a bit unclear to me though, observe
seems like a specific case of interaction with the environment and I am not sure why we need both.
(step! without using the returns and observe whenever needed).
@jbrea, @findmyway I'm not sure this will work that well. step!
will block until it returns. I think we would want interact!
(possibly named act!
) as well.
The difference between observe and interact! is a bit unclear to me though.
interact!
is just step without returning.
What will observe
return? In particular, will it return the reward?
I believe, observe
only returns the current observations for the agent. reward is returned only when the agent interacts.
I believe,
observe
only returns the current observations for the agent. reward is returned only when the agent interacts.
This makes morse sense to me.
I think some concrete examples might really help this discussion. We have touched on the multi-agent case and the real-time case, but it would probably be easier to reason about them if we wrote out some code snippets or pointed to existing code (I made the usage-example tag for this purpose. We also may want to wait to design features like this until someone is actively working on a problem that requires it (Perhaps @findmyway is already :smile: ).
I want to cc @pmm09c into this discussion. Peter and his colleague approached me about working on scalable multi-agent RL, and they have already written some papers on the topic. Perhaps they have some input/usage examples that are relevant?
@darsnack thanks, @rejuvyesh is actually who we're working with along with a few other folks so happy to see him here.
@rejuvyesh , for context we're super focused on supporting distributed compute for handling complex/slow multi-agent envs on the grid. It seemed like it made a lot of sense to use Julia for this, so I started working on a reverb like buffering system to help facilitate that.
Anyway, our use case might be a bit niche, but I'm glad to track everything going on here so I can try my best to keep things compatible.
Should we add
observe(env)
as an optional interface function? See this explanation by @findmyway why this may be useful in the multi-agent setting. Or should we even make this mandatory and havereset!
andstep!
returnnothing
?