Add trajectory-based dynamics model

natolambert commented 2 years ago

TODO for this WIP PR:

[x] New PID based / linear feedback agent(s)
[ ] Make PID accept vector inputs
[x] Training example
[ ] Migrate example to colab
[ ] Add tests

Types of changes

[ ] Docs change / refactoring / dependency upgrade
[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)

Motivation and Context / Related issue

I'm collaborating with some folks on Berkeley looking to apply the trajectory-based model to real world robotics, so I wanted to integrate it into this library to give it more longevity.

The paper is here. The core of the paper is proposing a long-term prediction focused dynamics model. The parametrization is:

$$ s{t+1} = f\theta(s_0, t, \phi),$$

where $\phi$ are closed form control parameters (e.g. PID)

Potentially this #66 , I think we will need to modify the replay buffer to

store control parameter vector
store time indices (which may be close with the trajectory formulation)

How Has This Been Tested (if it applies)

I am going to build a notebook to validate and demonstrate it, currently it is a fork of the PETS example. I will iterate

Checklist

[ ] The documentation is up-to-date with the changes I made.
[x] I have read the CONTRIBUTING document and completed the CLA (see CONTRIBUTING).
[ ] All tests passed, and additional code has been covered with new tests.

natolambert commented 2 years ago

@luisenp do you know with the current replay buffer trajectory storing, if at training time it will be easy to get the "trajectory time index" corresponding to the step of the episode?

The trajectory-based model is trained on the full trajectory and each sub-trajectory of it, so I will need a tool to get all sub trajectories at the minimum. Happy to discuss on a call if it helps!

natolambert commented 2 years ago

Also, I'm not sure what to do with pre-commits, they're failing because I am using a different version of python I think?

gi[INFO] Installing environment for https://github.com/psf/black.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
t puAn unexpected error has occurred: CalledProcessError: command: ('/Users/nato/miniconda3/envs/mbrl/bin/python', '-mvirtualenv', '/Users/nato/.cache/pre-commit/repor4ja1kkq/py_env-python3.7', '-p', 'python3.7')
return code: 1
expected return code: 0
stdout:
    RuntimeError: failed to find interpreter for Builtin discover of python_spec='python3.7'

Though if I change it, it wants me to stage them 😥

[ERROR] Your pre-commit configuration is unstaged.
git add .pre-commit-config.yaml to fix this.

natolambert commented 2 years ago

@robertocalandra @albertwilcox may be interested.

luisenp commented 2 years ago

Hi @natolambert, thanks for the PR! I have a few questions:

I'm wondering if having a new replay buffer class for this is overkill, couldn't we just store the control parameter and time index as part of the state in a regular replay buffer?
By getting "all subtrajectories", are you referring to all N*(N+1) / 2 subtrajectories of a trajectory of length N? My first though is that the cleanest would be to create a new iterator class that gets the full trajectories from the replay buffer, and then serves the subtrajectories in its loop.

Also, can you try checking out branch lep.change_precommit_yaml and let me know if this solves the precommit issues you are having?

natolambert commented 2 years ago

I really like the sounds of that, let me try and make those additions. It removes complexity in a way that I think is fitting.

natolambert commented 2 years ago

This seems like a good direction, I'm realizing I will need to add another class similar to OneDTransitionRewardModel because the prediction formulation is as follows (rather than one step):

$$ s{t+h} = f\theta(s_t, h, \psi)$$.

Not sure if I need a new MLP function, if I do it will just be for different trajectory propagation options (because it's no longer recursive, but rather all elements in one forward pass).

Will push more changes shortly. Relatedly, why is it called OneDTransitionRewardModel? that's always confused me.

I guess maybe we can spoof it as follows: Inputs: state = state + time index, action = control parameters Outputs: state at future time index

With the right propagation function, nothing needs to change, including the replay buffer. We'll see if I can get it to work.

luisenp commented 2 years ago

This seems like a good direction, I'm realizing I will need to add another class similar to OneDTransitionRewardModel because the prediction formulation is as follows (rather than one step):

$$ s{t+h} = f\theta(s_t, h, \psi)$$.

Not sure if I need a new MLP function, if I do it will just be for different trajectory propagation options (because it's no longer recursive, but rather all elements in one forward pass).

Will push more changes shortly. Relatedly, why is it called OneDTransitionRewardModel? that's always confused me.

I guess maybe we can spoof it as follows: Inputs: state = state + time index, action = control parameters Outputs: state at future time index

With the right propagation function, nothing needs to change, including the replay buffer. We'll see if I can get it to work.

TBH, I don't love the name OneDTransitionRewardModel either. A previous version was called ProprioceptiveModel, but an earlier user pointed out that in some cases the same type model could be used for inputs that wouldn't be considered proprioceptive. We brainstormed a bit on what a general name for this model would be, and this was the winning option. It essentially refers to the fact that the input to this model is a 1D array with [states, actions] and the output is a 1D array with [next_states, rewards].

natolambert commented 2 years ago

Your feedback is making me think I may not actually need any of the new code I proposed other than some agent definitions for the environment I want to use.

Hacking all the input-output pairs to be correct has gotten pretty far. Let me finish this, then we can see if we want to add more (which will make it clearer for people who are interested). The clear difference is trajectory propagation when compared with the GaussianMLP class.

natolambert commented 2 years ago

Got "loss going down" which is always an exciting point. I will add more validation soon. Screen Shot 2022-08-09 at 7 46 59 PM

natolambert commented 2 years ago

@luisenp I got the initial notebook done. I've actually set it up so I can reset the rest of the files, and just merge the notebook when we are happy with it. I think the difference in the MBRL-Lib reacher env vs the one used in the paper is causing some numerical instability, but the general principle is close to being validated!

Here's a long-term prediction accuracy comparing the one-step model to the traj-based model. I'm not 100% sure I'm predicting with the traj-based model right, because I couldn't use the wrappers. It predicts by passing an initial state, control parameters, and a vector of time indices.

Let me know if you look at the structure of the notebook!

Some things I want to get to:

set up batched inference so I can use ensembles (any tips on this?),
try a different environment,
see if more data helps numerical stability.

Here's the plot comparing long-term prediction accuracy.

luisenp commented 2 years ago

That's a nice plot! I'll take a closer look at this soon (hopefully tomorrow Friday, if I have time). Thanks!

natolambert commented 2 years ago

@luisenp I guess the most relevant question is if any of the code in-the-works here interests you in the library or if we should just go notebook-only.

I ended up removing most of them. I can add a commit that reset's everything except for the notebook. I re-used all vanilla stuff in a way that could be slightly confusing. In the case when people are interested in this, we could add official code that makes it much simpler to use. Some of the things I'm using in potentially non-intended manner:

trajectory storing in an extra replay buffer because trajectory-based model trains on sub-trajectories,
putting control parameters + time horizon into actions (the naming discrepancy could be confusing for people),
had to deconstruct util.rollout_agent_trajectories to store the data I needed (even if I just wanted one replay buffer).

I think that until the model gets more traction, notebook only is probably okay (I will reset all the other changes in the PR). It shows a good research use-case for how the library can be used flexibility.

luisenp commented 2 years ago

If everything you need to run this example is in the notebook, then that's definitely a good starting point! In that case I can focus on reviewing the notebook more carefully. We don't need to remove the other files yet, maybe some of them will be useful later (thinking of the PID agent here).

That said, it would be useful if you can remove any superfluous stuff from the notebook, and maybe add some short text highlighting the parts I should focus more on when reviewing?

natolambert commented 2 years ago

Yeah will do, in that case I will do a full pass to clean things up (including addressing the things in the PID agent and making it more compatible with the style guidelines / precommits).

natolambert commented 2 years ago

@luisenp added tests, made batch friendly, and cleaned everything.

natolambert commented 2 years ago

After our discussion today, I think I am happy with this as a compromise:

transfer the notebook to colab, and link to colab in readme + link to source in my mbrl-lib fork (after minor improvements).
make this PR just for PID control.

If it gets solid uptake, we can polish it further. Thoughts?

luisenp commented 2 years ago

Hi Nathan, that sounds good, I'm OK with this plan.

natolambert commented 2 years ago

Colab is here: https://colab.research.google.com/drive/15lodC9KyzzQCv9hQY3wtAe-yYOdk9vZB?usp=sharing My dev repo is here: https://github.com/natolambert/mbrl-lib-dev/tree/traj-model (will merge into my main branch when we're done here)

natolambert commented 2 years ago

Any thoughts on adding a vanilla reward function, like no_termination (for unrolling states where you don't care about reward, or for in reacher where there is no reward function right now).

def no_reward(act: torch.Tensor, next_obs: torch.Tensor) -> torch.Tensor:
    return torch.Tensor(0)

facebookresearch / mbrl-lib