facebookresearch / mbrl-lib

Library for Model Based RL
MIT License
959 stars 158 forks source link

Pddm #88

Closed freiberg-roman closed 2 years ago

freiberg-roman commented 3 years ago

(WIP) PDDM implementation

Motivation and Context / Related issue

PR for PDDM's MPPI planner, support for sequenced batches, and in the near future proper settings and benchmarks for MuJoCo environments.

Checklist

luisenp commented 3 years ago

Hi @freiberg-roman, this is awesome! Thanks a lot for your interest and contribution. I'll take a closer look soon. As a high level question, have you tested the PDDM implementation in some environments already? Do you have any plots you can share?

Also, from a very quick glance it looks like formatting can be improved in some places. Please install pre-commit to run some hooks automatically for you before a commit (sorry, I should have added instructions for this in the README).

freiberg-roman commented 3 years ago

Hi @freiberg-roman, this is awesome! Thanks a lot for your interest and contribution. I'll take a closer look soon. As a high level question, have you tested the PDDM implementation in some environments already? Do you have any plots you can share?

Also, from a very quick glance it looks like formatting can be improved in some places. Please install pre-commit to run some hooks automatically for you before a commit (sorry, I should have added instructions for this in the README).

cleaned up a bit. I'm currently benchmarking on common MuJoCo environments and will probably give an update next week. On trivial environments like CartPole it seems to run well.

luisenp commented 3 years ago

Hi @freiberg-roman. I took a look at the sequential SequenceTransitionIterator, which is actually something I was about to start writing for a PlaNet implementation in #86. I like most of the logic of what you did, but have some comments about how to structure it differently. We could have some discussion about this here, but I'm also thinking it would be easier if I just take your code as a starting point and make the changes. I can do a PR to your fork (just for the replay buffer changes) so you can give some input, as it will probably mean you'll have to change some of your PDDM code; once it's ready we can merge the new replay buffer into master from your fork. How does that sound?

freiberg-roman commented 3 years ago

Hi @freiberg-roman. I took a look at the sequential SequenceTransitionIterator, which is actually something I was about to start writing for a PlaNet implementation in #86. I like most of the logic of what you did, but have some comments about how to structure it differently. We could have some discussion about this here, but I'm also thinking it would be easier if I just take your code as a starting point and make the changes. I can do a PR to your fork (just for the replay buffer changes) so you can give some input, as it will probably mean you'll have to change some of your PDDM code; once it's ready we can merge the new replay buffer into master from your fork. How does that sound?

Sure, I'm planning a proper rework anyway, so it would be nice if we could find a common API for the replay buffer now. This version is more or less a hack to get it running.

luisenp commented 3 years ago

@freiberg-roman I made a new pull request for the sequential iterator (#91), would you mind taking a look? I ended up only using a small portion of your code, since I did some refactoring of the base classes to make it cleaner.

Note that this iterator returns an extra dimension for the sequence, so it's a bit different than what you currently have in this PR. You will need to change a few things in your PDDM code, but hopefully just some reshapes in the right places.

freiberg-roman commented 3 years ago

@luisenp hello, sorry for not giving any update had a lot going on in the last few weeks. Unfortunately there were some bugs and I have to restart my experiments. I already had good results with ground truth dynamics (also added a small notebook to play around with) and am in the process of investigating most components separately to get the reported performance out of PDDM. This could probably take a month. I think the code is mostly done in terms of structure. If you would like, we could work on finalizing this PR and I could start a new one just for tuning. What do you think about this?

luisenp commented 3 years ago

Hi @freiberg-roman, thanks for the update! I've also been busy with other things which is why I haven't done a great job at following up on PRs lately, but things should clear up after this week. Let me start by playing around a bit with your code first and run the examples you currently have, and we can go from there. One possibility would be to merge the trajectory optimizer separately if it's working as expected with ground truth dynamics, and defer integrating PDDM until we have a bit more understanding of the issues.

luisenp commented 3 years ago

Hi @freiberg-roman. Sorry for the delay, I just noticed that the last thing I said was that I would take a look at your examples, and somehow I thought I had asked you a question and was waiting for answer. In any case, is this the notebook to test on ground truth dynamics?

freiberg-roman commented 3 years ago

Hi @luisenp. Sorry, I'm short on time at the moment, but I'm still working on it (hope it's not time critical). I have been testing components in a separate repository and will start reintegrating the changes soon. I will share my W&B reports when everything is ready. This notebook is similar to the one I am currently using for ground truth dynamics.

I have a question about the PETS HalfCheetah environment. What is the reason for adding sine and cosine to the 2nd and 3rd observation dimension?

luisenp commented 3 years ago

Hi @freiberg-roman, thanks for the update, and looking forward to seeing the changes, and let me know if there is anything I can do to help :)

With regards to the PETS question, this comes from the original implementation of PETS, and we kept it here mostly for the sake of completeness and for debugging our implementation. If I'm not mistaken, the rationale is that it makes it easier to learn angles, since the model doesn't have to learn something like 0 being close to 2 * pi. Personally, if I were to properly benchmark different MBRL algorithms, I'd do it in the unmodified mujoco-gym versions, and not on custom versions such as this one.

luisenp commented 3 years ago

Hi @freiberg-roman. This PR is quite big now :) Are there any sub components that are stable and would be ready to integrate in master? We could make separate PRs for some of them and start adding some of your contributions to master. For example the MPPI optimizer?

freiberg-roman commented 3 years ago

Hi @luisenp. Good idea, I wasn't aware of the LOC count. Did a small PR just for one environment #119 . The performance with GT is similar to the notebook version added here.

freiberg-roman commented 2 years ago

This PR is quite outdated. I think it would be better to continue with other PR's first :))