Farama-Foundation / Metaworld

Collections of robotics environments geared towards benchmarking multi-task and meta reinforcement learning
https://metaworld.farama.org/
MIT License
1.27k stars 273 forks source link

ML1 Tasks for Constant Goals #24

Closed michaelzhiluo closed 4 years ago

michaelzhiluo commented 4 years ago

Currently, we are trying to use specific environments in ML1 to set a goal constant per task in a MAML-setting (with env.reset() meaning that initial positions change but goal stays constant)

However, we are not clear on what a task means in the ML1 setting. Based on the code for one of the environments we are trying to run, it seems like calling self.set_task will update self.goal. However, when the environment is reset, self._state_goal is initially self.goal but is then assigned a randomly generated goal + a concatenation of initial reacher arm positions, which also appears to be random. When self.random_init is False, it works as intended but the starting states are constant.

We wondering if there is a way to define a task using the metaworld API such that for a given task a goal position is held constant but initial observation changes when env.reset() is called.

varun-intel commented 4 years ago

I am having a similar problem. If I run these lines following the readme: ` from metaworld.benchmarks import ML1

env = ML1.get_train_tasks('pick-place-v1') # Create an environment with task pick_place tasks = env.sample_tasks(1) # Sample a task (in this case, a goal variation) env.set_task(tasks[0]) # Set task

obs = env.reset() # Reset environment a = env.action_space.sample() # Sample an action obs, reward, done, info = env.step(a) # Step the environoment with the sampled random action print('Goal in info: {}'.format(info['goal']))

obs = env.reset() # Reset environment a = env.action_space.sample() # Sample an action obs, reward, done, info = env.step(a) # Step the environoment with the sampled random action print('Goal in info: {}'.format(info['goal'])) ` the output is: Goal in info: [0.08237526 0.81740662 0.21224671] Goal in info: [0.02779365 0.80701352 0.05445151]

According to the paper, each task in ML1 should have a single goal.

michaelzhiluo commented 4 years ago

@ryanjulian Is there a way to have each task represent a single goal while varying initial states?

michaelzhiluo commented 4 years ago

@ryanjulian @tianheyu927 For MAML experiments, the paper says the number of rollouts per tasks is 10 for ML1. Did all the envs in a single task have self.random_init set to False, which means that all episodes have the same starting state and goal?

ryanjulian commented 4 years ago

@michaelzhiluo @varun-intel Thank you for your questions. I apologize for the delay. The team was out most of last week for the US Thankgiving holiday. Work-life balance is a core value in this org, and you should generally not expect responses over major holidays.

Before we answer, I should note that the public API surface of metaworld is limited to the metaworld.benchmarks module. Everything else, especially the internal environment APIs, should be considered private and extremely unstable. Limiting the API in this way is one of the compromises we had to make to deliver this benchmark to the community.

I'll let @tianheyu927 and @zhanpenghe comment on the logic used for goals and random_init.

michaelzhiluo commented 4 years ago

@ryanjulian @tianheyu927 @zhanpenghe Thank you for fixing part of the code for self.random_init! We have one more question, as we want to generate similar MAML results for ML1 reported in the appendix of the paper.

image

The paper states that each task in ML1 is one of the 50 random initial object and goal position. We are wondering on the exact details of what entails a task in a MAML setup.

Suppose there are 20 workers and 10 environments/worker (vectorized setup). In MAML, each worker represents a different task. For the Metaworld implementation in the paper, does each environment, given the same worker, have the same goal and initial object position? Or...is it that each task is a collection of randomized obj and goal positions, meaning that environments per worker have different goal and initial object positions?

Our reason for asking this question is because self.random_init determines the former and latter case respectively.

michaelzhiluo commented 4 years ago

@ryanjulian @tianheyu927 @zhanpenghe Just to follow up... we are wondering on how a task was defined for the experiments ran for ML1. Specifically, we are interested to see how each task is defined across vectorized environments:

Your response is greatly appreciated!

ryanjulian commented 4 years ago

Initial conditions are not randomized in ML1. The important snippet is here: https://github.com/rlworkgroup/metaworld/blob/dfdbc7cf495678ee96b360d1e6e199acc141b36c/metaworld/benchmarks/ml1.py#L22, which sets the constructor arg random_init to False for all environments in the ML1 benchmark.

ML1 varys the goal position, but the diversity your meta-learner is exposed to during meta-training is controlled (it only gets to see 50 unique goals).

Though ML1 measures performance on intra-task variation and ML10/ML45 measure meta-learning performance on inter-task variation, the interfaces are the same. The set_task interface is designed to allow for efficient vectorized sampling: if you want to transmit a meta-batch to remote or vectorized samplers, you can construct the environments once and only transmit the task information to each environment.

You can see this in the ML1 example in the README, when we call

# out loop, meta-batch sampling
env = ML1.get_train_tasks('pick-place-v1')
# sample a meta-batch
tasks = env.sample_tasks(1)  
# configure a single environment to represent a single element of the meta-batch
env.set_task(tasks[0])

# inner-loop, single-task sampling
obs = env.reset()
a = env.action_space.sample() 
obs, reward, done, info = env.step(a)  # Step the environoment with the sampled random action

Psuedocode for a naive parallelization of meta-batch sampling might look something like this. My example assumes your meta-batch size and your parallelization height (number of environments) is the same.

# setup
env = ML1.get_train_tasks()
meta_batch_size = 10
envs = [pickle.loads(pickle.dumps(env)) for i in range(meta_batch_size)]

for i in range(num_meta_itrs):
    # outer loop, meta-batch sampling
    tasks = env.sample_tasks(meta_batch_size)
    for e, t in zip(envs, tasks):  # parallel-for
        e.set_task(t)

    # inner loop, single-task sampling
    for e in envs:  # parallel-for
        path_length = 0
        obs = e.reset()
        while not done and path_length <= max_path_length:
            a = policy.sample(obs)
            obs, reward, done, info = env.step(a) 

Per-step vectorization would be similar, but there's a lot more bookkeeping to deal with different vectors terminating at different steps. Meta-test evaluations of ML1 look similar, but with get_test_tasks instead.

ML1 selects a new set of train/test goals each time you construct it by calling ML1.get_train_tasks() or ML1.get_test_tasks(). This can present a problem for multi-process or multi-machine sampling, in which many workers might construct ML1 instances in separate processes, giving them different sets of train/test goals.

Doing this wrong could accidentally expose your meta-learner to far more meta-train configurations than the benchmark allows. I agree that we should probably rethink to API to make this harder to mess up. I recommend that you configure your sampler to stuff the value of task into env_info, and then verify in your optimization process that your samples don't come from more than 50 unique tasks.

If you're going to use remote (multi-process or multi-machine) sampling for ML1, you have two options:

  1. Construct as many instances of ML1 in as remote processes as you like, but always sample the meta-batch (call ML1.sample_tasks()) from a single process and transmit the tasks to your workers, which then call env.set_task().
  2. Construct a single ML1 instance (ML1.get_train_tasks()) on a main process and transmit it to worker processes by pickling/unpickling. This preserves the set of train tasks used across machines, which are otherwise chosen-anew every time the benchmark is constructed. The same logic applies to ML1.get_test_tasks(). You may then sample tasks locally, because each worker is sampling among the same pre-chosen set anyway.

Edit: I realized solution (2) doesn't work with our current pickling implementation (which reconstructs the object anew during unpickling)

ryanjulian commented 4 years ago

@ahtsan @krzentner @naeioi @yonghyuc @lywong92 @CatherineSue @avnishn

ryanjulian commented 4 years ago

See https://github.com/rlworkgroup/metaworld/pull/40 which will make case (2) on https://github.com/rlworkgroup/metaworld/issues/24#issuecomment-576996005 work

michaelzhiluo commented 4 years ago

Thank for the clarification for ML1! We are able to reproduce similar results for reach-v1 with MAML. Making all workers (or tasks) and worker's environments to share 50 tasks in total significantly improved reward.

Lastly, just fyi, a small fix: args_kwargs[task_name]['kwargs']['random_init'] = False in ml1.py. That ensures that initial state and goal position are constant upon calling env.reset().

ryanjulian commented 4 years ago

@michaelzhiluo if you found a flaw, please open a PR so that everyone can share in your fix :)

avnishn commented 4 years ago

@michaelzhiluo, I believe that we've fixed this in our most recent update! We've updated the Metaworld API, so for any future projects please make sure to use this new API and update any ongoing projects :)