RatInABox custom Gym environment for RL

Matkicail commented 1 year ago

Hi,

Your repo is really cool and the visuals are excellent. However, when going through the RL example section, I find that although the example is decent, it still leaves a lot of the skeleton showing (in terms of all the direct stuff you do with the environment and the agent). I would really like to give PPO with an intrinsic curiosity module a go, since your environment is quite interesting. Specifically, to see what the rat would do if only rewarded with curiosity vs the environment reward and curiosity. It would be also be interesting to see if it has rat-like exploration and what would happen if the reward was as sparse as possible (e.g a reward for only reaching the reward square).

However, it feels a bit difficult to interact with it and get something like that working immediately. Mainly because the environment is not making use (at least in this case https://github.com/TomGeorge1234/RatInABox/blob/1.x/demos/reinforcement_learning_example.ipynb) of the standard Markov Decision Process abstraction. If you had this abstraction, I think it would be super cool since it makes it much easier to try out a variety of RL methods quickly and easily on the task, as it cleanly separates the RL part and your environment.

Not sure if this is a useful suggestion, but if it is I would be super keen to interact with the environment if it has this modification given how cool and interesting your repo looks.

TomGeorge1234 commented 1 year ago

Hi, and thanks for the question. I interpret from it that what you are asking is whether RatInABox will make use of the the gymnasium framework for standardising RL . If so, the answer is that we are looking into this and support the idea but it probably won't be about for a month or two. Also note that RatInABox is not first and foremost an RL package so inherent support for RL won't be our primary goal.

Another user (@SynapticSage) actually had a go at doing exactly this and kindly made a GitHub gist. It's not complete but might give you some pointers for how to do this yourself before we formally support this.

If I interpreted your question correctly let me know as I'll rename this thread and let it become the forum for discussing RiaB--Gym compatibility.

Matkicail commented 1 year ago

Yes, definitely you interpreted what I was asking correctly. When reading through the notebook I noticed some of the extra information that you capture and store, so I understand that the goal here is definitely more than RL. I'm looking forward to seeing the RL component of it and will check out that gist.

SynapticSage commented 1 year ago

... slightly more functional fork now. Previous code couldn't run -- was more provisional in nature.

https://github.com/SynapticSage/RatInABox/blob/task_environments/ratinabox/contribs/TaskEnvironment.py

Example code in "main" renders an agent + envrionment + spatial goal every 100 frames.

Missing:

Action space gym specification per agent -- one could imagine different controllers for the same agent which could take inputs from neurons or the environment or random
Reward structure
Storing statistics about the episode
Classes for dynamic things in tasks
pytest section -- (not yet tested for more than 1 spatial goal or agent)

TomGeorge1234 commented 1 year ago

Nice! I'm very impressed, just took a look and it seems really nice so I'm excited to have a play around. At some point perhaps would be great to PR this into the main branch but I'll leave some comments once I've taken a look and leave this thread open in the meantime.

Appreciate the legwork !

SynapticSage commented 1 year ago

Awesome. Ya comments would be appreciated. Not mature enough for a PR. Maybe eventually.

Top priority---my guess--would be figuring out how the agent should interface and register a controller for its actions to its member environment. Gymnasium environments store an observation space and an action space as attributes. And the action_space attribute can be a list/dict of action spaces for the registered agents who belong to the task.

But the main rub is that action spaces themselves, I imagine, could be based on different things. E.g. drift_velocity in your script. And then there's something upstream of the controller pulling the strings of the controller, e.g. an agent/neuron. And may need some brainstorming on how to organize interfacing up a given action_space to a agent's controller to variables within {agents, neurons, other-things}.

TomGeorge1234 commented 1 year ago

Yeah, I think I see. To me it seems quite natural that the state space should be a concatenation of the current firing rates of a list of RatInABox Neurons. So if the Agent has some PlaceCells and GridCells its state is entirely determined by the firing rates of those two populations (and nothing else). The action space should be, as you say, drift_velocity (and perhaps the ratio parameter determining drift-to-random motion proportions, since this is kind of close to $\epsilon$ in $\epsilon$-greedy it may be more of a hyperparameter).

Happy to jump on a call some time if it would help, keen to get this off the ground and I'm aware of other people who would use this functionality!

SynapticSage commented 1 year ago

For sure -- think I'm on the same page. Makes total sense neurons would inform the action space for obvious reasons---and the focus of your package. I was only probing to see if there was an appetite for knobs for other moments of inertia or input from other objects.

Pushed some updates. There's an action space (assuming drift_velocity knobs) for each agent, reward, and a loop invoking Gymnasium's step() paradigm.

while True:
    action = SomeNeurons.dift_velocity()
    new_state, reward, done, info = env.step(action)
    env.render()
    if done:
        break

Example of one of those step() cycles above (although not steered by value neurons; diffusion input is borrowed from the nearby goal-diffusion in your reward_leaning_example.ipynb) RIB_gymIntegration

Edit: I added a few pytest cases with fixtures. Also organized the gym.render()-ing pipeline into logical parts---it was messy and an overly bloated method. Theh TaskEnvironment now has render() for agents, environments. Spatial goal env merely adds render of goals to the end of the render pipeline. Tests pass for '2d' environments. Lot of broken things for '1d' environments.

TomGeorge1234 commented 1 year ago

Really like where this is going. I ran the code, it works great! Some thoughts:

Continuous time : We should make the time continuous so on each update() there are two updates: self.time += self.dt (defaulting to say 0.01) and self.steps += 1. The spirit of RiaB is very much to treat time and space continuous...a departure from classic RL where time is a more abstract thing.
Handy RiaB geometry funcs There's a useful (if poorly named) function called Env.get_distance_between___accounting_for_environment(pos1, pos2,wall_geometry="line_of_sight")...this does as it says on the tin and will return very large number if there is a wall obstructing the straight line path between pos1 and pos2. This will be necessary in the future to leverage full flexibility of RiaB environments with complex wall structures.
Aesthetics Is it possible to maintain consistency with RiaB plotting functions? i.e. use Agent.plot_trajectory(). I'm suspecting maybe this wouldn't be easy as you are doing this efficiently by resetting (rather than replotting) the data each time, nice work btw. I wouldn't die on this hill but I would like to maintain aesthetic consistency where possible. It might help if I restructured how plotting works and wrote a function which returns a matplotlib LineCollection. Lmk.
Multiple Agents I really like that you can have multiple Agents. I'm aware this is probably a faff for you to handle but please keep going with it! By the way, wouldn't this be better handled by petting zoo (I'm no expert here!). Also has gym been replaced with gymnasium - if so we should switch for future-proofing.
Rewards In the long run rewards could be defined by some neurons. I'm sort of on the fence about this, maybe it's nice to keep rewards external to the agent. Anyway, might consider adding small delay after entering reward zone before termination (or having a gaussian continuous reward function). So rewards are slightly less sparse.
1D On the other hand, very happy for you to let go of maintaining functionality in 1D Environments, at least for now. :)))
Action space Lastly in terms of the action space. I sort of agree with your pseudo code above ^ except it should be more general. Whilst it could be something like SomeNeurons.drift_velocity()...in theory it could be something more complex e.g. actor-critic SomeActorTrainedBySomeOtherPackageInterfacingWithThisGym.drift_velocity() or PPO or Q-learning whatever ! The example I gave of linear ValueFunction ascent is pretty basic and ultimately quite limiting.

SynapticSage commented 1 year ago

Continuous time and geometry funcs: Great suggestion! I hadn't considered these yet. I have now implemented the changes.
Aesthetics: This might be feasible. render() already calls the environment's plot and could call agent's plot first to initialize the line collections. I'll see if it's quick to implement.

Further down the line, one might consider steering away from caching everything inside the environment class. Rather, have each object (environment, agent) be responsible for implementing its own render(), ie, cache its own plot objects. It's potentially odd having a task environment cache plot objects about non-task environment features...maybe... For example, if someone implements a new agent, e.g., a replay agent, it might be awkward to modify a TaskEnvironment to render it appropriately.

A possible alternative: maybe one could call the environment'senv.render(). Whereon, env would intialize or update its cached plot objects and then call render() on any agents/objectives that it contains if they also implement a render(), and so on... Maybe even in some cases, neurons attached to agents have a render() method turned on to plot inside the env.step() loop.

Multiple agents: I was imagining that some users may eventually want to study social vector cells or social tasks; maybe RiAB could help with that. Not tricky to allow---basically just an extra loop.
gym/gymnasium: I'm behind the times. I learned how to use gym in 2019. Didn't know that openai moved its functionality to gymnasium. Seems to keep the same interface as far I can tell. Good points. Wouldn't want the objects calling deprecated code.
Same story for pettingzoo: it arrived after my last brush with the API, to my knowledge.
Reward: Great ideas! The reward pulses were just a quick n' dirty implementation win. Will check it out.
Action space: 🙂 Had meant as more of an example as opposed to prescription. Aware.

TomGeorge1234 commented 1 year ago

Ok, I will definitely consider restructuring some of the plotting stuff (as you say, seems like each Agent and Environment classes could have an internal _render() function which plots everything it needs to and returns all matplotlib plot object. The plot_trajectory() or whatever function would then just set the data on all these objects). It's a medium sized change however and not essential so for now let's presume to work around it.

Maybe worth looking into petting zoo. I don't know if it would streamline the multiagent stuff, presumably they built it for a reason 🤷🏼‍♂️

Nice progress though! keep me updated how things get on, looking forward to having it complete!

SynapticSage commented 1 year ago

Excellent. Ya, no rush on the _render(). It's fine without it.

Pushed some changes aligning the code with pettingzoo yesterday

step works as expected for the multi-agent case.
It's potentially annoying to work with dicts of actions per agent for single-agent environments, so added a step1() wrapper function that makes the task environment behave like gymnasium. step1 is short for "step with 1 agent". Cuts all the boilerplate and fluff associated with mutli-agent---if people are into that.

Generalizing the reward structure next: Provisioned structure for it.

Plan is to attach one cache object containing an agent's active rewards to the environment and put a reference to that same cache inside the agents. That way rewards can be triggered externally from the environment or internally from the neurons that are part of the agent.
Rewards are a function with an expiration time instead of a point: presets are constant, linear or exponential; users can set a function by hand.

SynapticSage commented 1 year ago

Posted an update with non-sparse rewards. Have a few features that decide a reward's dynamics:

initial value
decay rate for the reward from the initial value (linear, constant, exponential)
follower term that can be used to attach a ramping signal or goal gradient to a reward
expiration time: how long does the reward stay active from the moment it appears. None means that the reward never expires, and continues following its gradient and decay dynamics.

TomGeorge1234 commented 1 year ago

Awesome!! Happy to see such rapid progress progress being made. Thanks for upgrading to pettingzoo, I see what you mean about it being a bit annoying for single agents but your solution seems good. And also thanks for updating the plotting stuff, really nice :))

I like the reward flexibility here - both being able to have multiple rewards and agent or environment dependent rewards. Seems sensible that rewards can be environment, not agent, specific in the cases that multiple agents are competing for the same external rewards. It's also really nice how the rewards can be set on each trial and are stored in a growing cache and expire after a set amount of time.

Could you please clarify to me the difference/connection between Objectives and Rewards. I'm a tiny bit confused about the causality here. Objectives just determine the conditions for ending a trial but are they linked to the rewards? As I understand it, each time an Environment is reset new goal positions are sample each defining a spatial Objective. Does this then trigger the creation of a Reward object which gets added to the cache?

SynapticSage commented 1 year ago

Ya, no worries. Happy to clarify! Not particularly married to this scheme if its awkward.

They are triggering creation of reward objects, as you say, in the example instance of TaskEnvironment.

In more detail, Objective represents a specific task rule or condition that must be fulfilled to solve the task. When an agent resolves an Objective, the Objective may optionally release a Reward object. A task episode consists of a list of Objective, and the episode ends when all Objectives are completed. In this way, Objectives represent a collective terminal state.

Now that said, a task environment does allow having one without the other: you could create a rule without a reward or create a reward without a rule. In other words, a user constructing a task could choose to release elicit Reward/(punishment) without an Objective or vice-versa. An objective is, if anything, just an organizing class to build a task rule. I imagine if people contribute different tasks (as they do neurons) to RiAB, objectives from different tasks could be mixed and matched to make new tasks.

When an environment reset()s, the env replenishes a list of Objective from a pool of potential objectives for a task episode. This is controlled by how users design the reset() for their task; it could be as random or deterministic the task maker would like.

I'll try to simplify the objective/task interface. I sense that it could be more straightforward than it is. Not a crisp design.

SynapticSage commented 1 year ago

Commit from earlier this week.

`Objective` and `Goal` have similar semantic feelings ---a 
potential source of confusion. Additionally, `Objective` could 
be wrongfully confused by the user as the objective function of 
some RL or otherwise algorithm running the agents. Hence, I 
refactored everything clarify; `Objective` has been changed to 
`Goal`, which cleans up semantics.

----------
New object 
----------

Similar to how `RewardCache` tracks a collection of rewards, 
`GoalCache` now tracks a collection of goals. In particular, 
how one or many agents can interact with them. Some things it 
can handle for the user:
- do goals have to be completed in sequence? or does order 
  not matter?
- Does each agent have to finish a given goal? Or is it 
  consumed by another agent?

`GoalCache` is basically a fancy goal list that handles 
tracking how agents satisfy goals under those differing 
schemata.

Still a few bugs to be ironed out. Hoping that object relations read/feel a little more intuitive and clean now.

TomGeorge1234 commented 1 year ago

Very happy with this. It makes sense that rewards and objectives (now goals) should be disentangled as in reality they needn't be one-to-one. At the end of the day some kind of demo and readme/doc will be necessary to explain all this to end users (as it isn't. trivial) but we can cross that bridge when we come to it. The flexibility you've coded into this will pay off in the long run; as it stands (once bug etc. are ironed out) this could account for so many spatial behaviour set-ups, very impressed.

I thought I'd share this here so others can see the progress made so far:

https://user-images.githubusercontent.com/41446693/231140004-9a408a29-0a7c-4801-82a8-b8e68c0cb02e.mov

SynapticSage commented 1 year ago

Totally agree that the TaskEnvironment, Goal, and Reward objects might not be trivial for users to quickly grasp. Creating a markdown document doc file with illustrations will make it much easier to comprehend---learning tools from comment headers can be challenging.

Currently, there are some bugs and untested features:

goalorder = sequence (bug)
agentmode = noninteracting (untested)
a couple reward-related options (untested)

Busy lab week ri now; I will try to address these on Friday/Saturday, possibly set some pytests. After which, I imagine it might be ready to document and PR into RiAB.

Awesome---glad that things are coming together.

SynapticSage commented 1 year ago

Bugs ironed out. Should be ready to draft some doc files later this week.

TomGeorge1234 commented 1 year ago

Brilliant! Feel free to PR when you feel it's ready. As this is a standalone script the requirements for inclusion aren't too high and we can carry on ironing out bugs once it's live.

TomGeorge1234 commented 1 year ago

Hi @SynapticSage how is this coming along? Reckon it will be ready for a PR some time soon?

SynapticSage commented 1 year ago

Apologies for the delay. Hectic month; PR fell off my radar.

I'll submit it this weekend. Code is operational, but the tutorial needs a little more info here and there. I'll be continuing to refine it after the PR -- as suggested.

Thanks for your patience!

TomGeorge1234 commented 1 year ago

No worries, was just checking everything was fine. Definitely happy for you to continue to refine it afterwards, much better that way

TomGeorge1234 commented 1 year ago

Closing as this has now been pushed to contribs. Thanks!

TomGeorge1234 commented 1 year ago

Hi @SynapticSage, I would like to make a SpatialGoalEnvironment where there is a short delay after completing the spatial goal before the episode terminates, this is to allow the Agent to experience the temporally extended reward - otherwise I'm concerned that the reward will be experienced in the following episode (where, here, I'm teleporting the agent to a new location) and credit misattributed. What's the best way to do this?

(also heads up some minor QOL improvements to TaskEnvironment files in latest push)

A related question:

Are these lines important? The default to remove an agent from an environment once goals are completed is causing problems if I want that episode to carry on a little bit afterward (throws an error because no agents are around to receive the actions). Shouldn't this be optional or am I misunderstanding something?
```
# If any terminal agents, remove from set of active agents
truncations = self._dict(self._is_truncated_state())
for agent, term in self._dict(self._is_terminal_state()).items():
if term and agent in self.agents or truncations[agent]:
    self.agents.remove(agent)
```

SynapticSage commented 1 year ago

Just shot over a PR for this. The PR adds an shortcut option for episode padding. Sans the PR, one could just add an unrewarded timer() goal at the end of the goal sequence. But since this is a super common need (e.g. for ISIs), just added a more direct shortcut for this: delay_episode_terminate TE attribute. It's more or less accomplished by something similar, padding an unrewarded timer goal when agents finish their episode goals.

Nice QOL improvements: I do think the code formatting could use standards.

SynapticSage commented 1 year ago

The side question:

Yes, unfortunately. It's a pettingzoo attribute. pettingzoo expects this agents variable to track the keys of active agents. In other words, it tracks agents that are still able to receive actions/updates during the episode. I wasn't originally planning to have a variable like this, but pettingzoo requires it to pass the environment test. if an agent terminates, it's supposed to be removed. at present, terminated ≡ all_agents_finished_goals. may be possible to uncouple those.

TomGeorge1234 commented 1 year ago

Life saver! Thanks Ryan, super quick and elegant fix. Btw, to me it seems more natural to define the time delay in simulation time coordinates (env.t) rather than real time (time.time). Do you agree? I switched it and pushed the changed but happy to roll back or have it be a parameter.

Also I made a Reward instance called no_reward_default which is a reward which quickly expires and gives no reward. I made this the default for the TimeElapsedGoal instance created by the TaskEnv for the delay period so that it just acts as a pure delay and doesn't then give an additional reward afterwards (open to suggestions if you have thought of a better way to create unrewarded Goals)

SynapticSage commented 1 year ago

Absolutely! env.t. Totally my mistake. Was distracted while I was typing that part out.

Regarding the reward, you can also give it a None object instead of no_reward_default, but maybe it's nice to be explicit with a well-named variable.

TomGeorge1234 commented 1 year ago

A None object would be cleaner but doesn't work because of the line self.reward.goal = self in the Goal class which throws AttributeError: 'NoneType' object has no attribute 'goal'. Happy to leave as is for now

SynapticSage commented 1 year ago

Sounds good. last PR or one before made a change where goals assign a reference to their reward. Must have broke the None option.

Will try to resurrect it next PR.

RatInABox-Lab / RatInABox

RatInABox custom Gym environment for RL #30

Missing: