Closed SeanNaren closed 7 years ago
Looks good! Might as well copy as much of the API as we can. Could rename actionSpec
to actionSpace
? Think that returning it from a function call is a bit cleaner than having it as a property, but we could change that if you feel that it's better. actionTaken
from env:step(action)
should definitely be replaced with the optional info
, and we can leave it up to people how they want to add information to the info
table. A render
would be nice, but beyond the scope of this PR.
Otherwise agreed with making the reward structure the same as gym
, provided the environment dynamics are the same. Anyone who was using rlenvs
seriously for benchmarking can simply use v1
.
Could you point me to where actionTaken
is? Can't seem to find it. I've updated the names of the function, hopefully that's what you had in mind!
I'll get to work on the reward schemes for things that cross over to gym!
Ah sorry that is only mentioned in the docs, but isn't implemented in any of the envs provided - the context can be found in https://github.com/Kaixhin/Atari/pull/56. I've notified the author of the PR that we're moving towards a more gym
-compatible API, so to put that on hold for now.
Sorry but do you mind doing stateSpace
to getStateSpace
, etc.? Since it's implemented as a function in rlenvs
, I think the getter/setter nomenclature makes it a little clearer.
I tried to change rewards for the CartPole environment to match the reward scheme for gym but ran into issues.
For gym there is +1 for the time the pole is up, 0 for the time the pole is down. Just a sanity check should this then be reward = 1
and this be reward = 0
? Getting strange behaviour when replacing this environment with the gym one in torch-twrl!
Once I resolved why this is the case I want to add an additional function to the init that will list all environments that are possible, if that sounds good!
EDIT: I think I found the reason for the inconsistencies, gym has max steps for each of their environment (terminate after those amount of steps to prevent it going forever). Should we implement this into rlenvs?
Yep sanity check looks correct to me - let's go ahead with changing the reward to match.
I see you've already implemented the list - looks good. Remember to document in the README as well.
Is the max steps setting for all environments (including Atari)? If it's only there for the simple envs then it would be a good thing to add it into opts
for these environments, with the gym defaults (if any).
FYI I'm going to add the environment from Towards Deep Symbolic Reinforcement Learning into the v1
branch in about 3 weeks. If you could merge and update it into twrl
that would be appreciated :)
Sounds good, I'll work towards that. So for the max step thing, here is how it's done for a few environments in openai-gym. How about I expose the step method in the Env
class, and reference a variable called maxSteps that are set per environment. Using this variable I return termination in the Env class? Something like this:
local classic = require 'classic'
local Env = classic.class('Env')
-- Denote interfaces
Env:mustHave('start')
Env:mustHave('act') -- change name here to prevent overload
Env:mustHave('getStateSpace')
Env:mustHave('getActionSpace')
Env:mustHave('getRewardSpace')
function Env:step(action)
local reward, state, terminal = self:act(action)
self.currentStep = self.currentStep == nil and 0 or self.currentStep
if self.maxSteps and self.currentStep % self.maxSteps == 0 then -- env defines maxSteps
terminal = true
self.currentStep = 0
end
self.currentStep = self.currentStep + 1
end
return Env
Overall the solution is good, but a few suggestions. I'd go with Env:mustHave('_step')
to make it pretty clear that the internal method is hidden/allow people to use act
for whatever reason. Secondly, if you haven't already, it's worth checking exactly how gym
defines the max amount of timesteps - does it include the state from starting the environment? Finally, I'm not sure why you went for self.currentStep % self.maxSteps == 0
instead of just self.currentStep == self.maxSteps
?
Not entirely sure tbh, had a brain fart :P Made the changes and will now go through and add default steps as done in openai-gym (I've set the default to 1000, which is the default for openAI-gym, but if it's an issue lemme know!).
I think the cleanest way to do this is to have Env.lua
actually implement _init
with self.maxSteps = opts.maxSteps or 1000
, and have every env first call super._init(self, opts)
. This is where it should go, and it'll stop recalculating it on every step
. Also, if I'm correct, this is going to run into problems unless you reset self.currentStep
in start
as well. Which probably means handling the logic in _init
, start
and step
and using super
calls.
After super._init(self, opts)
I think it's worth matching the maxSteps
to gym
if it actually differs from 1000 according to this.
I've made a few changes, added an init function in the Env.lua
class as well as added the super.init
call to all the envs. I've added default max steps and some logic in the Env.lua
class to handle different step cases, lemme know of any feedback!
Also for rendering, shall we make it possible to render via the Env.lua
class by passing in a render boolean in the opts? I can get to work on that if you think it would be good!
@korymath Rendering would be nice - any suggestions for how best to integrate it with gym
/twrl
?
A boolean in the opts would be a good means by which to pass the option, but is perhaps not the full solution.
My inclination is that this would allow for a strictly Lua RL infrastructure, free of connection to gym, and thus the rendering could be completely separate from that rendering. Having not done much graphics work in Lua, I am not sure how to best make this happen.
One potential means would be to save the transitions, and then format them in such a way that they can be loaded and replayed in the gym infra (that is, if the environments are exactly the same as those in gym, but that may be limiting, and require offline processing in between learning and display).
I would argue, that, while rendering is nice, it is not critical to learning (unless there is a learning signal derived from pixels, or a human observer of the visualization, or maybe a third case?). Thus, even a basic render of the learning performance could be generated every N steps, much like how the OpenAI gym allows for a video to be rendered.
Nice works on the max steps discussion, that is an interesting quirk of the OpenAI gym code, that i think is based on the online leaderboard, and the desire for shorter episode lengths for direct comparison.
I added a :render()
method which will check to make sure that a :getDisplay()
method exists to retrieve a displayable screen and render it using qt (check the Env.lua
class). I've also updated the experiment.lua
class, check it out!
This looks great!
Thoughts on how to handle execution on a server? Maybe similar to gym and allow creation of a 'fake' display for output?
On 24 October 2016 at 11:14, Sean Naren notifications@github.com wrote:
I added a :render() method which will check to make sure that a :getDisplay() method exists to retrieve a displayable screen (check the Env.lua class). I've also updated the experiment.lua class, check it out!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Kaixhin/rlenvs/pull/8#issuecomment-255804029, or mute the thread https://github.com/notifications/unsubscribe-auth/AAK3s7JfHSknlRmkY3I3CCXuN3Qhz5SXks5q3OdkgaJpZM4KLzuC .
Kory Mathewson
Thanks @korymath! That's an interesting idea, maybe similar to how the gym API is currently interfaced with? If this is deemed necessary I could start thinking how to implement this (probably out the scope of this PR)
Sounds great!
Agree it is probably another large feature build.
On 25 October 2016 at 12:49, Sean Naren notifications@github.com wrote:
Thanks @korymath https://github.com/korymath! That's an interesting idea, maybe similar to how the gym API is currently interfaced with? If this is deemed necessary I could start thinking how to implement this (probably out the scope of this PR)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Kaixhin/rlenvs/pull/8#issuecomment-256137117, or mute the thread https://github.com/notifications/unsubscribe-auth/AAK3s7h1dASukqgzOELIIlAC_g95a2raks5q3k8ygaJpZM4KLzuC .
Kory Mathewson
Sorry, small piece of extra work! Just managed to add in "XOWorld", which is the little domain featured in "Towards Deep Symbolic Reinforcement Learning", in the v1
branch as well as master
- haven't added it to twrl
though.
Seems like the merge of the branch replayed the commits on top... once I've finished with XOWorld I think I'll rebase the branch and open a new PR, is that alright?
Also there is a small issue here, comma missing at the end :)
EDIT: XOWorld is an awesome environment!! Should be fully implemented now and should wrap up everything. Just thinking about unit-tests over here to try make sure everything works correctly now!
Thanks for spotting that - now fixed on v1
and merged into master
.
Yes a rebase would clean this up, good job for suggesting :) There shouldn't be any more changes due now.
Thanks, in terms of unit tests any suggestions on functionality to verify and check?
Good question. mustHave
already helps a little bit. Seeing what I just did, maybe static syntax checking would be useful? Going further, argcheck
might be worth adding in.
I suppose you could test an environment by loading it, starting it, and doing at least one step. Not sure if there's anything that needs to be done if someone tries to step
before start
. Checking maxSteps
works really depends on the environment (e.g. Cartpole will fail quickly with a random agent), unless you manually set it to a very low number.
I started thinking about testing Atari, but there's licensing issues with ROMs. I thought maybe there might be a homebrew ROM that we could include in the repo, but then it'd need to be coded into ALE...
Feel free to use torch-twrl as a testing base if you want. It has some basic testing functionality built up here: https://github.com/twitter/torch-twrl/blob/master/test/test-gym.lua
It runs a few tests for different environments, the runTest code is from the gym-http-api, you can see the full extent of it here: https://github.com/openai/gym-http-api/blob/master/binding-lua/test_api.lua It is a pretty verbose test that runs through the RL environment functionality.
It is quite easy to find the ROMs if you are particularly motivated.
Thanks for that, I can definitely try imitate those tests for all the rlenvs environments!
EDIT: Also saw malmo added (which is awesomesauce!) I'll pull this in and get this implemented as well. EDIT2: I will wait!
@SeanNaren I'm trying to get Malmo working myself, so bear with me a while. New plan is to target v1
, then merge into master
(they're now the same, even though the v2 rockspec shouldn't be in v1
, but oh well). I'll aim to get it done by the end of today, but will report back once I've got it working successfully.
OK that was a bit of a mess but I seem to have knocked down the requirements nicely and condensed the code. Had a quick look, but couldn't get qlua
working - the screen does however display in the running client. Also couldn't get it working with Atari
because of some issue with luaposix
, which is a shame, but I won't have time to investigate further.
Right I think that's minecraft handled (I need to still test that it works, will do once I'm home on my own machine) and now just the tests.
Currently I just run through every environment but need to add some assertions here, however not sure what I could assert on. In the torch-twrl tests we assert on a successful run through (return true as long as everything runs correctly).
Great work!
You can do simple assertions on type of variables response, or an assertion range.
ex:
runTest could return an error code with a pcall, if the return is nice then assert on that.
On 17 November 2016 at 03:30, Sean Naren notifications@github.com wrote:
Right I think that's minecraft handled (I need to still test that it works, will do once I'm home on my own machine) and now just the tests.
Currently I just run through every environment but need to add some assertions here https://github.com/Kaixhin/rlenvs/blob/twrl/tests/test.lua, however not sure what I could assert on. In the torch-twrl tests we assert on a successful run through (return true as long as everything runs correctly).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Kaixhin/rlenvs/pull/8#issuecomment-261211646, or mute the thread https://github.com/notifications/unsubscribe-auth/AAK3syevaHnwcMmWAfxEET18at7EHqVIks5q_CzBgaJpZM4KLzuC .
Kory Mathewson
Added assertions, check them out! I also rebased the branch into one nice commit, check here. I think the changes are done here so it be great if you could have a quick look to let me know if anything needs changing!
What do we plan to merge into? I can then open the PR on the rebased branch.
Unfortunately I'll be away this weekend so can't look at it till next week, but the plan is to merge this into master
and update the readme to show installing the v2
rockspec. It's a bit messy, but then we can look at the v1
branch and make sure that is fine (e.g. by deleting the v2
rockspec inside that branch, but hopefully that's it).
This way anyone who's still using the old API (e.g. the Atari repo) can still do so by requiring the v1
rockspec. :) No need to backport new environments unless we're requested to. twrl
will of course use v2
.
Sounds like a plan and no worries, whenever you get time!
Looks good! Firstly, can you merge the latest Minecraft stuff (both in the env and the readme)? Secondly, if you could use 2-space indentation for consistency that'd be awesome. I'll have another look after and then we should be able to merge twrl-rebase
into master
.
Having issues merging master into the rebased branch. Can you see any commits from master not in the rebased branch?
Ah sorry the latest work should be in v1
, which should be merged into twrl-rebase
, which should then eventually be merged into master
.
This is continued in #14
Just opening this PR to track progress based on this PR. Need to test each environment with twrl to make sure they behave correctly, if you spot any issues or have any feedback let me know :)
It may also be nice to set the rewards similar to gym as well for comparison sake!