Closed mtcrawshaw closed 4 years ago
I would try to submit a PR for this myself, but it seems like it may require rewriting the observation boundaries for each environment, and I don't know where those values came from. Any help?
They are mostly from the MuJoCo models themselves.
For the robot, they come from the kinematics of each joint.
For elements of the environment, they come from the state variables of those objects. Technically, they could appear anywhere so those values might be [-np.inf, +np.inf]
. Some, such as rotation, may be limited depending on the representation used.
Thanks for the reply! When I said that I don't know where those values come from, I was referring to the hard coded values of the bounds of the observation space in each environment. For example, in sawyer_assembly_peg.py
, we have the following initialization:
hand_low=(-0.5, 0.40, 0.05)
hand_high=(0.5, 1, 0.5)
obj_low=(0, 0.6, 0.02)
obj_high=(0, 0.6, 0.02)
These values seem more or less arbitrarily chosen, and they aren't actual bounds on the observations, because MuJoCo doesn't consider these bounds and often violates them. It seems that the easiest fix would be to set each low and high value to -np.inf
and np.inf
, respectively, but this feels hacky and I'm not sure if it might break something somewhere else. Do you know where those values came from and/or whether they can be overwritten without causing issues?
I tried overwriting those values and there are still issues. In sawyer_assembly_peg.py
, I replaced the four lines above (lines 30-33) with the following:
BIG = 10000
hand_low = (-BIG, -BIG, -BIG)
hand_high = (BIG, BIG, BIG)
obj_low = (-BIG, -BIG, -BIG)
obj_high = (BIG, BIG, BIG)
goal_low = (-BIG, -BIG, -BIG)
goal_high = (BIG, BIG, BIG)
Note that BIG
must be finite because these observation bounds are used on line 199 as arguments to np.random.uniform
. If you make this change and then run the script I posted in my original post (modified to only run "assembly-v1"), there are still out of bounds issues. If you print out the values of the observations in that script, you can see that the size of the observations tend to scale with the value of BIG
. I think it's likely that this is because of that call to np.random.uniform
that I pointed out earlier: When reset()
is called on the environment, the initial object and goal positions are generated uniformly between the bounds of the observation space. So a bigger bound on the observation space just means that the object/goal position will likely have higher magnitude, and some percentage of the time it will be initialized very close to the bound, at which point it is a likely candidate for going out of bounds at some point during rollout.
I think that this may be emblematic of the larger problem that MuJoco doesn't seem to interact with those observation bounds in any way, so there is no way to guarantee that the objects will actually stay within those bounds. This isn't really something I'm personally interested in working on, I'm just going to ditch RLlib and look for/make an implementation that isn't conditioned on the observation space not being violated. I suppose I'll leave this issue open, since it does seem like this might be an issue in the future for code that expects this condition to be met.
@ryanjulian @avnishn hello all,
I was working on issue #31 and realized this and that understanding this issue would help with that one as well. I have been recreating the initial issue with a slightly modified version of the launcher mtcrawshaw wrote above. The first change is that Mujoco environments have a max_path_length
of 150, so I adjusted TIMESTEPS_PER_ENV
to 150 instead of 1000. I am assuming this doesnt change the intent of the script.
Secondly, due to some inheritance issue, the observation_space.low and observation_space.high values referenced in the script are universally arrays of +-inf
, instead of the values set in the constructors of an individual environment. For example, SawyerBinPickingEnv
has values
low = [-0.5 0.4 0.07 -0.5 0.4 0.07]
high = [0.5 1. 0.5 0.5 1. 0.5]
yet in the script it thinks high and low are arrays of +-inf
. The values are properly set in the constructor of the SawyerBinPickingEnv
, but since ML1 creates MujocoEnv
objects, it gets the incorrect values of low and high (using the defualt MujocoEnv
values instead of the specific SawyerBinPickingEnv
values). I am working on a fix for this issue, and will ask you as I get further along what to test to see if it breaks anything.
I modified the script one more time to just hardcode the values for testing purposes. the final launcher script is below.
import numpy as np
from metaworld.benchmarks import ML1
from gym.spaces import Box
TIMESTEPS_PER_ENV = 150
def main():
# Iterate over environment names.
# for env_name in ML1.available_tasks():
for env_name in ['bin-picking-v1']:
# Create environment.
env = ML1.get_train_tasks(env_name)
tasks = env.sample_tasks(1)
env.set_task(tasks[0])
# Get boundaries of observation space and initial observation.
low = env.observation_space.low
high = env.observation_space.high
# override above values with the bounds set in SawyerBinPickingEnv constructor
hand_low=(-0.5, 0.40, 0.07)
hand_high=(0.5, 1, 0.5)
obj_low=(-0.5, 0.40, 0.07)
obj_high=(0.5, 1, 0.5)
space = Box(
np.hstack((hand_low, obj_low, hand_low)),
np.hstack((hand_high, obj_high, hand_high)),
)
low = space.low
high = space.high
print(low, high)
obs = env.reset()
# Create list of indices of observation space whose bounds are violated.
broken_indices = []
# Run environment.
for _ in range(TIMESTEPS_PER_ENV):
# Test if observation is outside observation space.
if np.any(np.logical_or(obs < low, obs > high)):
current_indices = np.argwhere(np.logical_or(obs < low, obs > high))
current_indices = current_indices.reshape((-1,)).tolist()
for current_index in current_indices:
if current_index not in broken_indices:
broken_indices.append(current_index)
# Sample action and perform environment step.
a = env.action_space.sample()
obs, reward, done, info = env.step(a)
# Print out which indices of observation space were violated.
broken_indices = sorted(broken_indices)
print("%s broken indices: %r" % (env_name, broken_indices))
if __name__ == "__main__":
main()
And I was able to recreate the error. I now have a few questions regarding how to fix this. What do these observation space upper and lower bounds represent? Will I need to add a bounds check to every step function for every environment? And how does the reward get modified? Do I limit the observation to be within the bounds or do I raise an error of some sort? Any advice would be greatly appreciated.
@ryanjulian @avnishn Hello All
Update: I have a quick fix which at every step checks if the bounds are within every step count. This is essentially adding the following code to every environment in the _get_obs()
function:
def _get_obs(self):
hand = self.get_endeff_pos()
objPos = self.data.get_geom_xpos('objGeom')
flat_obs = np.concatenate((hand, objPos))
# My Fix - Check if observation space is violated
if not self.observation_space.contains(flat_obs):
raise ObservationSpaceBoundsError
However this is not very intelligent, as this does not make the model less likely to go out of bounds, and in the case of many models (for example SawyerReachPushPickPlaceEnv
, SawyerBinPickingEnv
, and the above SawyerAssemblyPeg
) the high and low values are the same, meaning the bounds are exceeded as soon as the first step is taken.
Who can I talk to to either get the real bounds of each of these environments, and/or give advice on updating the reward function to prevent the bounds from ever being broken?
@ryanjulian @avnishn Hello All,
I spent some more time looking at where the values used to define an environment's observation space are being updated, and they are being changed when we call self.sim.step()
inside the do_simulation()
function in mujoco_env.py
.
For example, SawyerReachPushPickPlaceEnv
uses the hand position and goal object position to define its observation space (each environment defines its observation space differently). Every step, the environment updates Mujoco's ctrl variable, and runs do_simulation
, which steps through the environment, which updates the hand position and goal object position in the environment. This step process does not take into account the bounds defined in the environment.
self.sim
is a Mujoco Simulation object, and since it's Cython it's a bit of a black box (and the documentation isn't super clear on using it). Is there a way I can define observation space bounds in the Mujoco environment? Let me know if there is any resource I can view to aid in this.
Hi @adibellathur -- sorry for taking so long to respond.
thank you so much for making an effort to fix this. i really appreciate it.
your fix in https://github.com/rlworkgroup/metaworld/issues/39#issuecomment-629577260 is a good start, especially to help you find the roots of the problem. it would also be a great feature for an automated test which makes sure this doesn't happen again (i.e. it could be included in tests/helpers..py::step_env
, for instance)
as you intuited, it's probably not a great permanent fix for a couple reasons: 1) this simulation code runs in the most time-critical path of large algorithm training, and checking at every step in the simulation makes it slower (you would need a profiler to measure how much, though) 2) more importantly than speed, raising an exception here isn't very helpful to a user, because by the time this happens it's already "too late". if it's rare, this exception could be raised a few minutes or hours into an unattended training run, on a far-off server machine, crashing a training process, perhaps running overnight, and losing someone 1-12 hours of work. what's more, since this code is a benchmark, the users of it don't really have any control over whether that exception is raised, because there's no configuration they can choose to avoid. this, along with killing their training job, would be really frustrating and lead to people mostly just giving up on the library.
(2) is why it's important to ensure that this issue is fixed by making the code correct from the start. to be trivially-correct, you can just set all the observation spaces to +/- inf in all axes. this may seem a little daft, but is actually really common in these simulations. the observation space bounds (as opposed to the action space bounds) are not even used by common algorithms, but they can be helpful for debugging.
if you want to be a little more helpful, you can deduce some real observation space bounds which are loose but correct. these don't have to be particularly-tight to be correct, for instance you could make them all the maximum/minimum limits of the workspace or robot arm gripper reach (whichever is largest) for all dimensions, assuming that they are all cartesian positions. how tight you can make them depends on how much time you invest in determining the model's actual limits.
to determine tight bounds, you will have to delve into the MuJoCo model XMLs and probably run the simulation interactively, and use your judgement to figure out each of these. unfortunately, the original designer just didn't do this, which is how we got here.
here's a quick blueprint of how this would work
mujoco_py
MjSim object (e.g. dir(self.sim)) there's a mapping from geoms to the state vector, which will be very helpful. This is a good time to note in code comments what each element actually is.Thanks for the input Ryan. I haven't looked at this issue in awhile but I am the OP. I don't think that setting the observation bounds to +/- inf will solve the issue without some more modifications. Somewhere up in the first couple comments I wrote about how in sawyer_assembly_peg.py
(and likely other environments) the bounds are used as an argument to np.random.uniform
to randomly initialize something (the initial arm location?), and passing in np.inf
into this function returns an error. With that in mind, we can probably just change the observation bounds to inf and pass in some large finite interval to np.random.uniform
. Just something to keep in mind. Not sure if any other issues would arise from changing the bounds to inf but I thought I'd throw this in here for anyone who decides to attempt a fix.
Hello @ryanjulian
Thank you so much for all the insight! This all makes a lot of sense. From what I can understand, the issue is really that the bounds hardcoded into the environments are not representative of the actual environment simulation, as opposed to the simulation not following the hardcoded bounds. So the proper fix for this issue is to update the bounds so they more accurately reflect the bounds of the actual simulation environment as opposed to modifying the simulation so they adhere to the arbitrary bounds. Is this the correct way to look at the problem?
Overall, my plan forward is as follows:
tests/helper.py::step_env()
, so that it can be used for testing. This check will no longer be in each specific environment.@mtcrawshaw thanks for following up so fast! and thanks for pointing out my error. you are correct -- we need to update the initial arm position sampling code to avoid this issue.
@adibellathur that sounds right to me!
hello @mtcrawshaw
Thanks for the insight! I'll try seeing if there is a finite set of bounds that can be used for the observation_space, so we don't have to worry about passing np.inf
. If not I might have to try something more complicated to make sure random initialization is still possible
@ryanjulian @avnishn Hello All,
Update on the observation space: I've been working on getting the SawyerReachPushPickPlaceEnv
observation space to be within the bounds, and have settled the following bounds. I wanted to run the logic by you before sending PR. Sorry for such a long comment.
The observation_space for this env is made up of:
[hand_pos_X,hand_pos_Y,hand_pos_Z, objGeom_pos_X, objGeom_pos_Y, objGeom_pos_Z]
and I updated the bounds to the following values:
hand_low = (-1, -1, 0)
hand_high = (1,1,1)
obj_low = (-0.4, 0.2, 0.012)
obj_high = (0.4, 1.0, 0.034)
For the object, it the bounds needed to be on the table plane in the environment, since the initial position is randomly instantiated from these bounds (@mtcrawshaw can you confirm?), so I found the x and y bounds for the table plane and used those values. Anything outside of the table plane caused the disk to behave weirdly.
The z values were found by finding the shortest and highest point of the middle of the disk, as explained in the below image (the highest z value is when the object gets tipped over and is balancing on an edge)
I added a small amount of buffer, as oftentimes the bounds would be met, but due to rounding the number wasn't exact and broke the observation space bounds (for example resting state of the puck had a z of 0.01499 most of the time, and occasionally clipped through the bottom of the floor with a minimum z value of 0.013)
For the arm, I stuck the arm out as far as it could go, and it never seemed to top 1.0 units in any direction. I made the arm's z bound 0 just so it couldn't clip through the table plane. These bounds don't seem particularly strict, but since they are not used for any random initialization and the arm is much more strongly constrained by Mujoco I didn't worry too much about finding the exact bounds.
So far I've been running this environment for 15000 steps just on repeat for the past hour and haven't come across any times where these bounds are broken, with one exception. Occasionally the arm will hit the puck, and the puck will start to roll, it then rolls off the table and breaks the bounds. I figure this should be flagged anyway so I didn't worry about that.
Let me know if this approach is correct. I can start doing this for more environments if it seems like I'm solving the problem correctly.
Excellent work. This is definitely the correct approach.
In my experience, the radius of the sawyer is around 1.2m at most. I think the origin of the environment is at the Sawyer base, so the gripper tip could never be much greater than 1.2. Of course, because of geometry, many arm motions are not possible at that radius (basically just pushing and pointing).
If you haven't yet, you should go look at the XML files associated with each environment in the assets directory. This will contain a lot of the coordinates and sizes which you might currently be determining experimentally?
@adibellathur @haydenshively is this still an issue?
Yes I believe it is. @adibellathur please correct me if I'm wrong, but I believe what we report as an "observation space" is really an "initialization space." Though we've changed its dimensions in the past weeks, we haven't changed its meaning.
Yeah the root of this issue initially is because self.observation_space is really a misnomer. It’s the space the initialized env first used to be within when random_init=True, not the true observation space of the environment.
@haydenshively did this get renamed in the new api? If not we should rename the variable to something else, and either leave the observation space implicit like gym does, or add a new object variable to let people access the true observation space of the env.
Hi all! I was wondering if there was any more progress here? I have been trying, as a familiarizing exercise, to run RLLib's PPO on ML1('reach-v1')
(through an otherwise currently private interface so I cannot share the source code). I've replaced the bounds in sawyer_reach_push_pick_place.py
with those from the long comment by @adibellathur on May 21. I get the following error that starts from https://github.com/ray-project/ray/blob/ray-0.8.6/rllib/evaluation/rollout_worker.py#L263
ValueError: ('Observation outside expected value range', Box(12,),
array([-0.03265107, 0.51487826, 0.2368844 , 0.18518629, 0.20118554, 0.01935325, 0. , 0. , 0. , 0. , 0. , 0. ]))
which does appear to be outside the bounds of the self.observation_space
. I realize I've not given you much to go on here, but I was wondering if anybody had any ideas.
Hi @Shushman thanks for checking out Metaworld! I think to begin with, in order to answer your question better, can you please tell us the commit of metaworld that you are using?
The reason why I ask is that in the last 3 weeks, we've updated Metaworld to a brand new API and changed some things internally with environments.
To accompany this change, we've been trying to solve this issue via #181. I'm sure that there are modifications that are good places to start, and the set of modifications that youre looking for may in fact exist. @haydenshively may be able to comment more.
Did this help? Lmk 😄
Thanks @avnishn
I am currently using 2d441eed70b6f5cb1f35883b0517c4bd2812268c though I have been locally editing it and gradually loosening the bounds as I keep getting observations outside them (which is definitely not the right thing to do long-term). I can track #181 and update it when it is merged.
@Shushman and @mtcrawshaw, I believe that we have closed this issue via #181. Please do let us know if you have anymore questions or problems, we're happy to help!
Thanks @avnishn I had temporarily reverted to the v1 API for now but I'll check out the v2 soon. I also wanted to confirm that the plots in the paper on Arxiv are on the v1 problems right?
That's a good question @Shushman . We don't have access to the version of metaworld that was used to generate the plots. The version with the new API, (what is most recently on master) is most true to the original paper. We strongly urge you to use that when using metaworld for your benchmarking.
@Shushman note that the v2 envs (referring to any environment with modifications from the published work) exist in this repository but are not yet part of the benchmark. All environments in the benchmark are still faithful to the original paper.
If you are referring to the "new" revised API, that's a different way of accessing exactly the same challenges from the paper. The "new" API is actually more faithful to the one used in the paper, and I urge you to use it.
I updated to the latest master of Aug 17.
import metaworld, random
ml1 = metaworld.ML1('reach-v1')
env = ml1.train_classes['reach-v1']()
env.set_task(random.choice(ml1.train_tasks))
env.reset()
array([-0.03265107, 0.51487826, 0.2368844 , 0.08047285, 0.61476518,
0.02 , 0. , 0. , 0. , 0. ,
0. , 0. ])
But then if I check the low and high of the observation space
env.observation_space.low
array([-0.51, 0.38, -0.05, -inf, -inf, -inf, -inf, -inf, -inf,
-0.1 , 0.8 , 0.05], dtype=float32)
Is this not out of bounds for the last but one element? (0.0 < 0.8)
@haydenshively
The observation spaces were written and tested for the case where the environments are fully observable. When they are partially observable, the last 3 elements of the observation get zeroed out, which is what's happening here... and then the observation space is incorrect. I can push a fix tonight.
The following is a minimal working example which shows that all of the environments produce observations outside of their observation space. All it does is iterate over each environment from ML1, sample and set a task for the given environment, then take random actions in the environment and test whether or not the observations are inside the observation space, and at which indices (if any) an observation lies outside of the bounds of the observation space. You will get different results depending on the value of
TIMESTEPS_PER_ENV
, but setting this value to 1000 should yield violating observations for most environments. This is an issue, say, for RL implementations like RLlib which expect observations to be inside the observation space, and makes the environment incompatible with such libraries. This might be related to issue #31, though that issue only points out incorrect observation space boundaries regarding the goal coordinates, and the script below should point out that there are violations in other dimensions as well.